Research On Machine Learning Methods For Identification Of Proteins Binding Sites

Posted on:2023-01-15

Degree:Master

Type:Thesis

Country:China

Candidate:C Yang

Full Text:PDF

GTID:2558307154474844

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

Accurate prediction of protein binding sites is crucial for the study of drug structure and protein functional annotation.Thousands of protein structural complexes have been stored experimentally in protein databases.However,this experimental method is time-consuming and expensive,and considering the importance of protein-residue interactions and the shortcomings of experimental methods,it is urgent to develop an effective computational method to identify protein-residue interactions using sequence or structural information.The study of protein binding sites is a typical sample imbalance problem,and the few classes(binding residues)are far smaller than the majority classes(non-binding residues).Traditional machine learning algorithms are not universal for this kind of research,and the results will be severely biased to the majority classes.Therefore,it is very necessary to develop a prediction method based on machine learning and design an accurate,convenient and efficient prediction model.In the study of protein-Ribonucleic Acid(RNA)binding site identification,this paper constructs its own data set to extract the sequence and structural features of the protein.Then,considering the influence of adjacent residues on target residues,a sliding window and a hexahedron are introduced to encode and combine features.On this basis,considering the unbalanced data set,Granular Multi-Kernel Support Vector Machine with Repetitive Under-sampling(GMKSVM-RU)is proposed.In order to evaluate the performance of the model,different feature combinations,different kernel functions and previous methods were compared.Comparison results show that the proposed approach on the training set and testing set the Matthew’s Correlation Coefficient(MCC)of at least 1%.In the study of protein-ligand binding site identification,Discrete Cosine Transform(DCT)is used to shrink Position-Specific Score Matrix(PSSM).In addition,Graph Regularized k-local Hyperplane Distance Nearest Neighbor(GHKNN)is proposed by introducing the graph regularization term and kernel learning on the basis of K-nearest Neighbor(KNN),which maps the feature space from low dimensions to high dimensions.In order to evaluate the model performance,the single classifier,the ensemble classifier and the previous methods were compared.The comparison results show that the MCC of the proposed method increases by more than 2% in most data sets.The research results of this paper can show that the features of proteins are relatively rich,but the structural features have higher precision than the sequence features.Secondly,GMKSVM-RU can effectively solve the problem of sample imbalance through multiple under-sampling.Finally,the introduction of graph regularization terms and kernel learning can successfully map features from low-dimensional to highdimensional,increase the correlation between features,and better filter noise points.

Keywords/Search Tags:

Site identification, Unbalanced sample, Binary classification, Sequence structure information

PDF Full Text Request

Related items

1	The SVM Algorithm And Its Application Based Data Preprocessing In The Kernel Space For Unbalanced Data
2	Test Signals Ananlysis And Study For Multivariable System Identification
3	Research On Unbalanced Classification Method Based On Reinforcement Learning
4	Research On Adaboost Improved Algorithm For Unbalanced Data
5	The Search Method Of Perfect Binary Complementary Sequence Pairs And Aperiodic Binary Sequence
6	Research And Application Of Unbalanced Data Classification Algorithm Based On Resampling
7	Research On SVM Classification Of Unbalanced Data And Its Application In Identify Poor Students In Colleges And Universities
8	Research On Binary Complementary Sequence And Binary Complementary Sequence Pairs
9	A Method Dealig With Sample Imbalances In Text Classification
10	Categories Of Unbalanced Data Integration Classification Research