Datasets used for deep learning
The full datasets were never used directly. First, the sequences were clustered (to remove duplicates and to average isoelectric point if multiple experimental data existed), then split randomly into 25% and 75% sets (test and training data sets, respectively). The training sets were used for the feature selection and training and (hyper)parameter optimisation using 10-fold cross-validation. The test sets were used only once to assess the final performance of the models. The main datasets used in the study are identified in bold. For individual datasets’ sequences and experimental isoelectric points, see Supplementary Data 1 file.
All datasets are available as 7z archive.