Font Size: a A A

Research On Protein Function Prediction Based On Machine Learning And Multi-source Data

Posted on:2022-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y X ShenFull Text:PDF
GTID:2480306536996679Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,the rapid development of high-throughput experimental methods has resulted in the production of a large number of new proteins.The gap between the number of proteins discovered and their function annotations has become larger and larger.Protein function prediction has become a core issue in the field of molecular biology research.Traditional protein function prediction methods are time-consuming and expensive,and rely on a single data source to express incomplete feature information.Therefore,how to choose a suitable machine learning method and build an effective model to fuse a variety of biological data is of great significance to predict protein function.Due to the limited predictive ability when only a single data source is used,this paper uses machine learning methods to extract the features of multiple data sources to discuss and study the problem of protein function prediction.First,in view of the problem that only the network structure cannot fully describe the protein information,a protein function prediction method based on SVM and multi-source data is proposed.Two information sources of protein interaction(PPI)network and protein sequence are selected to extract protein features from different angles.And use the result fusion strategy for classification prediction.Use deep autoencoder to fuse multiple heterogeneous PPI networks for node feature learning,use support vector machine to classify and predict;then use position-specific scoring matrix and Gaussian kernel similarity to obtain sequence similarity network,and calculate the maximum similarity probability to obtain the category probability of the sequence vector.Finally,the two types of probability vectors are merged,and the support vector machine is used for classification.The combination of multiple information sources effectively improves the accuracy of correctly annotating protein functions.Secondly,aiming at the problem of protein network data sparsity,a protein function prediction method based on network and node attributes is proposed,which comprehensively considers topological structure,sequence network and attribute characteristics to achieve accurate node classification.Through the use of autoencoders to characterize the PPI network characteristics,and the variational graph autoencoders to characterize the characteristics of the sequence network and node attributes,these three characteristics are combined,and the support vector machine is used to classify and predict.The feature fusion of different networks and node attributes enhances the effective representation of protein information.Finally,an experimental platform was designed,verified on a real data set,and the performance of the prediction model was evaluated.The effectiveness of the protein function prediction method based on machine learning and multi-source data was proved through the analysis of multiple sets of comparative experiments.
Keywords/Search Tags:Protein function prediction, Machine learning, Multi-source data, Network representation, SVM
PDF Full Text Request
Related items