Font Size: a A A

Using Multi-label Learning Methods To Study Protein Subcellular Localization Prediction

Posted on:2017-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhaoFull Text:PDF
GTID:2310330488468641Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Protein is main executor of gene function,it is the most important to research protein function on genomics research area. However, every subcellular provide place for protein to exert specific functions. Only when the protein transported to its corresponding subcellular, can we play a proper role in supporting the healthy and orderly development of life activities,otherwise, it causes the body function disorder, and even threaten the life and health. Therefore, the prediction of protein subcellular localization is the basis for the study of protein function. It also has important significance for the pathogenesis of some diseases and the development of new drugs.With the development of bioinformatics and genomics, the protein data obtained from the experiment are multiplied by geometric method, and the research which carried out by using the traditional way has transformed into using the bioinformatics method to process the massive data. And, because of a lot of experimental data show that more than 30% of the protein can also in multiple subcellular localizations or walk in multiple subcellular compartments, the research on protein subcellular localization prediction is changed from single-plex to multiplex. Then, using bioinformatics methods to predicting the multiplex protein subcellular localization is a hot research direction in the present study.Using bioinformatics methods to predicting the multiplex protein subcellular localization is usually divided into four steps: the first step is constructing efficient multiplex protein data sets; the second step is extracting comprehensive and effective feature for datasets; the third step is choosing classifier, because multiplex protein subcellular localization prediction is a typical multi-label learning problem, it is a key step to select a suitable multi-label classification algorithm; the last step is evaluating prediction algorithm, the prediction results are used to evaluate the classification algorithm.The key step of multiplex protein subcellular localization prediction is the selection of feature extraction methods and classification algorithms for data sets. Feature extraction methods are many, including based on sequence information feature extraction methods and based on annotation information feature extraction methods. In this paper, based on sequence information feature extraction methods are used to extract features of datasets, including:amino acid composition model, pseudo amino acid composition model, physicochemical properties of amino acid composition model, entropy density, autocorrelation coefficients coding, position vector composition model. Because each feature extraction method has its limitations, this paper will combine a variety of feature extraction methods to and compare the results, in order to extract a more comprehensive and effective feature.The problem is a typical multi-label classification problem, and a lot of multi-label classification algorithms are emerged with the emergence of the problems. Commonly used are multi label k-nearest neighbor algorithm(ML-KNN), back propagation neural network multi-label algorithm(BP-MLL), multi-label support vector machine algorithm(Rank-SVM), decision tree multi label algorithm, LEAD algorithm and so on. This paper introduces the several kinds of algorithms, and the multi label k nearest neighbor algorithm(ML-kNN) is applied to classify and predict data sets, and achieves higher prediction precision.
Keywords/Search Tags:subcellular localization, feature extraction, feature fusion, Multi-label learning, multi-label k-nearest neighbor algorithm
PDF Full Text Request
Related items