Font Size: a A A

Research On Machine Learning Methods For Identification Of Proteins Binding Sites

Posted on:2023-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:C YangFull Text:PDF
GTID:2558307154474844Subject:Engineering
Abstract/Summary:PDF Full Text Request
Accurate prediction of protein binding sites is crucial for the study of drug structure and protein functional annotation.Thousands of protein structural complexes have been stored experimentally in protein databases.However,this experimental method is time-consuming and expensive,and considering the importance of protein-residue interactions and the shortcomings of experimental methods,it is urgent to develop an effective computational method to identify protein-residue interactions using sequence or structural information.The study of protein binding sites is a typical sample imbalance problem,and the few classes(binding residues)are far smaller than the majority classes(non-binding residues).Traditional machine learning algorithms are not universal for this kind of research,and the results will be severely biased to the majority classes.Therefore,it is very necessary to develop a prediction method based on machine learning and design an accurate,convenient and efficient prediction model.In the study of protein-Ribonucleic Acid(RNA)binding site identification,this paper constructs its own data set to extract the sequence and structural features of the protein.Then,considering the influence of adjacent residues on target residues,a sliding window and a hexahedron are introduced to encode and combine features.On this basis,considering the unbalanced data set,Granular Multi-Kernel Support Vector Machine with Repetitive Under-sampling(GMKSVM-RU)is proposed.In order to evaluate the performance of the model,different feature combinations,different kernel functions and previous methods were compared.Comparison results show that the proposed approach on the training set and testing set the Matthew’s Correlation Coefficient(MCC)of at least 1%.In the study of protein-ligand binding site identification,Discrete Cosine Transform(DCT)is used to shrink Position-Specific Score Matrix(PSSM).In addition,Graph Regularized k-local Hyperplane Distance Nearest Neighbor(GHKNN)is proposed by introducing the graph regularization term and kernel learning on the basis of K-nearest Neighbor(KNN),which maps the feature space from low dimensions to high dimensions.In order to evaluate the model performance,the single classifier,the ensemble classifier and the previous methods were compared.The comparison results show that the MCC of the proposed method increases by more than 2% in most data sets.The research results of this paper can show that the features of proteins are relatively rich,but the structural features have higher precision than the sequence features.Secondly,GMKSVM-RU can effectively solve the problem of sample imbalance through multiple under-sampling.Finally,the introduction of graph regularization terms and kernel learning can successfully map features from low-dimensional to highdimensional,increase the correlation between features,and better filter noise points.
Keywords/Search Tags:Site identification, Unbalanced sample, Binary classification, Sequence structure information
PDF Full Text Request
Related items