Feature selection procedure
One of the first steps in machine learning is the identification of informative features. In this project apart from the amino acid sequence, one can identify additional features to boost the prediction accuracy. For this purpose, AAindex database has been used.
The most informative AAindex features for protein isoelectric point prediction (IPC_protein_75 dataset)
The most informative AAindex features for peptide isoelectric point prediction (IPC2_peptide_75 dataset)
For brevity, the feature selection for individual pKa datasets (8 similar tables) had been skipped.
The available data for IPC_protein_100
datasets were highly limited. This was way below the level that could be used in deep learning
, therefore the augmentation technique had been used (details of the augmentation scheme will be released after manuscript publication
The amino acid sequences (one-hot-encoding) plus extra features are converted into vectors that can be used in machine learning supervised training. The number of architectures had been tested. The final model architecture consists of the mixture of the convolution and dense layers, ReLU, Softplus and Softsign activation functions. To avoid the overfitting 10-fold cross-validation and the dropout had been used. Details of the model will be released after manuscript publication
In a nutshell, we used: