| Protein is an important component in biological cells,which plays a significant role in many biological activities.In biological environment,protein needs to interact with ligands to realize its specific functions,therefore,accurately identifying the protein-ligand binding site is remarkable for understanding the protein function,the mechanism of diseases and new drug development.According to the information used in the research,the prediction of protein-ligands binding sites can be classified into two classes including the prediction methods based on sequence information and based on structural information.The prediction method based on sequence information only uses features extracted from protein sequences combined with machine learning algorithm.Compared with prediction method based on structural information,the sequence-based prediction method has the advantages on the amount and variety of storage data,which brings a wider application prospect.For now,the sequence-based prediction method is facing with two aspects of challenges:First,the severe imbalance between labels in protein-ligand binding datasets which means the number of binding residues is much smaller than the number of non-binding residues.Second,the classification performance of machine learning algorithm still needs to be improved.Based on the above problems,the research of this thesis aims to predict the protein-ligands binding sites based on sequence information,the main contributions and innovations of this thesis are as follows:1.The hybrid prediction method for protein-RNA binding sites.In this thesis,a novel hybrid prediction method is proposed for protein-RNA binding sites,which realizes the combination of a sequence feature-based method and a sequence templatebased method.The sequence feature-based method extracts three sequence features including the amino acid physiochemical property,the evolutional conservation and the evolutionary co-variation score,combined with random forests algorithm to obtain the original classification probability of target residue.In order to offset the biased classification results caused by the imbalanced learning,a probability adjustment algorithm is proposed considering the aggregation of binding sites in protein sequence,which effectively corrects the wrongly predicted samples in the original classification results.The sequence template-based method searches the homologous segments between the query sequence and template sequence,and maps the binding sites in the template sequence into the corresponding positions in the query sequence.Finally,two prediction methods are combined according to their specific performance characteristics to construct the hybrid prediction method.2.The prediction of protein-DNA binding sites based on neighboring residue correlations.As the basic component of protein sequence,each kind of amino acid owns its unique physiochemical property.When the target residue along with its neighboring residues form the sequence segment,its DNA binding capability will be affected by the physiochemical properties of its own and its neighboring residues.In this thesis,we propose a N-stage probability adjustment algorithm based on the neighboring residue correlations,which could identify more binding sites in the query sequence that were not effectively predicted by the classifier because of the imbalanced learning.Consequently,the proposed algorithm effectively improves the overall prediction performance.3.The prediction of protein-ATP binding sites based on convolutional neural network.As a kind of small molecule,the number of protein-ATP binding sites is much smaller than the number of binding sites between protein and large molecules,which causes more severe imbalance in the datasets.Therefore,it’s necessary to develop a machine learning algorithm with stronger classification performance.In this thesis,we propose two classification frameworks named Residual-Inception-based predictor and Multi-Inception-based predictor on the basis of deep convolutional neural network.The proposed classification frameworks extract deep representations of sequence features using the characteristics of convolutional neural network.Meanwhile,during the calculation of loss function,a larger weight is given to samples in the minority class which makes the frameworks focus more on the classification accuracy of minority class.At last,the two classification frameworks are combined to further improve the overall prediction performance.4.The prediction of protein-ATP binding sites based on ensemble of convolutional neural networks and Light GBM algorithm.On the basis of the above research,we further consider the effect of feature differences on prediction performance and propose two classification frameworks with separated features as inputs.Two frameworks named Multi-Incep Res Net-based predictor and Multi-Xception-based predictor are constructed using the Inception module and the Xception module of convolutional neural network respectively.Moreover,the proposed frameworks are ensembled with Light GBM algorithm to chase better performance by increasing the diversity of prediction method.Based on the protein sequence information,we have proposed four prediction methods for protein-RNA,protein-DNA and protein-ATP binding sites.Facing with difficulties in current research,we propose the probability adjustment algorithm with correction capability and construct convolutional neural network frameworks with better classification performance.The experimental results demonstrate that our proposed methods have achieved better performance in evaluation criteria.The contents of this thesis have reference values for prediction research of binding sites between protein and other ligands as well as for other prediction problems in bioinformatics. |