Datasets used for deep learning
The full datasets were never used directly. First, the sequences were clustered (to remove duplicates and to average isoelectric point if multiple experimental data existed), then split randomly into 25% and 75% sets (test and training data sets, respectively). The training sets were used for the training and (hyper)parameter optimisation with 10-fold cross-validation. The test sets were used only once to assess the final performance of the models.
All datasets are available as 7z archive.