Font Size: a A A

Protein Subcellular Localization Based On Feature Selection And Cost-Sensitive Learning

Posted on:2019-09-28Degree:MasterType:Thesis
Country:ChinaCandidate:L ChengFull Text:PDF
GTID:2370330548473458Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Generally,protein classification problems mainly have the following steps:(1)to build reasonable protein data set;(2)to convert protein information into feature vector by feature description method;(3)to reduce dimensionality because of the high dimension of the original data set;(4)to establish a classification model for protein sorting.(5)to use the test method and evaluation indicator to measure the classification effect.How to improve the accuracy of classification of protein and reduce the demand for memory has always been one of the most important issues for researchers to pay attention to.Characteristics of the project and classification algorithm is one of the most key of the two technologies.The feature expression determines the upper limit of the classification effect,and the model and algorithm only reach the upper limit as far as possible.Therefore,based on the prediction of protein subcellular location,this paper has carried out related research on the expression and classification models of protein subcellular location.The main works and innovations are as follows:1.Proposed a method of feature selection and weighting fusion filtering the data characteristics,so as to get the optimal feature set and reduce the data dimension.Because of the biological data has a large amount of data and characteristics of high dimension,complicated and time-consuming calculation,so first will get the biological data for dimension.In this paper,we put forward the algorithm of SVM-Logistic-RFE,the introduction of feature selection methods,it does not change the original characteristic value,eliminate redundant and irrelevant features,only select the part of the most useful features,and the recursive feature elimination method and support vector machine(SVM)and Logistic regression,the combination of characteristics of the original filter,respectively,from their own most of sub feature set,and get a new optimal weighted fusion feature set.Finally using the k nearest neighbor classification algorithm.Experiments show that:(1)after using feature selection,classification effect is enhanced;(2)the classification of the two kinds of feature selection after fusion effect is better than single feature selection.2.According to the imbalance of the protein,the paper puts forward to the naive bayes and decision tree algorithm based on the cost-sensitive learning(NBDT-cs algorithm).The imbalance of the data categories is rarely considered in the traditional protein classification problem.In this article,we introduce the concept of the cost-sensitive learning,and take cost gain as the attribute selection of decision tree,then,A naive bayes algorithm with cost expectation is applied to the leaf nodes of decision trees.Finally I put forward to the naive bayes and decision tree algorithm based on the cost-sensitive learning(NBDT-cs algorithm),which can effectively solve the imbalance of the data categories.Experiment results show that:(1)NBDT-cs algorithm is more effect than single naive bayes algorithm and decision tree algorithm,and slightly better than k neighbor classifier;(2)without reducing the overall classification accuracy,the classification accuracy of a few categories can be improved...
Keywords/Search Tags:Feature selection fusion, Imbalance problem, Cost sensitive learning, The nuclear cell localization, Gram type bacteria
PDF Full Text Request
Related items