Feature selection procedure
One of the first steps in machine learning is the identification of informative features. In this project apart from the amino acid sequence, one can identify additional features to boost the prediction accuracy. For this purpose, AAindex database has been used.
The most informative AAindex features for protein isoelectric point prediction (IPC_protein_75 dataset)
The most informative AAindex features for peptide isoelectric point prediction (IPC2_peptide_75 dataset)
The most informative AAindex features for pKa prediction (IPC2_pKa_75 dataset)
The available data for IPC_protein_100
datasets were highly limited. This was way below the level that could be used in deep learning
, therefore the augmentation technique had been used (details of the augmentation scheme will be released after manuscript publication
The amino acid sequences (one-hot-encoding) plus extra features are converted into vectors that can be used in machine learning supervised training. The number of architectures had been tested. The final model architecture consists of the mixture of the dense layers and activation functions (ReLU, Softplus and Softsign). To avoid the overfitting 10-fold cross-validation and the dropout had been used.
In a nutshell, we used:
Or to be more exact:
where deep learning models look like: