Protein Subcellular Localization Based On Feature Selection And Cost-Sensitive Learning

Posted on:2019-09-28

Degree:Master

Type:Thesis

Country:China

Candidate:L Cheng

Full Text:PDF

GTID:2370330548473458

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Generally,protein classification problems mainly have the following steps:(1)to build reasonable protein data set;(2)to convert protein information into feature vector by feature description method;(3)to reduce dimensionality because of the high dimension of the original data set;(4)to establish a classification model for protein sorting.(5)to use the test method and evaluation indicator to measure the classification effect.How to improve the accuracy of classification of protein and reduce the demand for memory has always been one of the most important issues for researchers to pay attention to.Characteristics of the project and classification algorithm is one of the most key of the two technologies.The feature expression determines the upper limit of the classification effect,and the model and algorithm only reach the upper limit as far as possible.Therefore,based on the prediction of protein subcellular location,this paper has carried out related research on the expression and classification models of protein subcellular location.The main works and innovations are as follows:1.Proposed a method of feature selection and weighting fusion filtering the data characteristics,so as to get the optimal feature set and reduce the data dimension.Because of the biological data has a large amount of data and characteristics of high dimension,complicated and time-consuming calculation,so first will get the biological data for dimension.In this paper,we put forward the algorithm of SVM-Logistic-RFE,the introduction of feature selection methods,it does not change the original characteristic value,eliminate redundant and irrelevant features,only select the part of the most useful features,and the recursive feature elimination method and support vector machine(SVM)and Logistic regression,the combination of characteristics of the original filter,respectively,from their own most of sub feature set,and get a new optimal weighted fusion feature set.Finally using the k nearest neighbor classification algorithm.Experiments show that:(1)after using feature selection,classification effect is enhanced;(2)the classification of the two kinds of feature selection after fusion effect is better than single feature selection.2.According to the imbalance of the protein,the paper puts forward to the naive bayes and decision tree algorithm based on the cost-sensitive learning(NBDT-cs algorithm).The imbalance of the data categories is rarely considered in the traditional protein classification problem.In this article,we introduce the concept of the cost-sensitive learning,and take cost gain as the attribute selection of decision tree,then,A naive bayes algorithm with cost expectation is applied to the leaf nodes of decision trees.Finally I put forward to the naive bayes and decision tree algorithm based on the cost-sensitive learning(NBDT-cs algorithm),which can effectively solve the imbalance of the data categories.Experiment results show that:(1)NBDT-cs algorithm is more effect than single naive bayes algorithm and decision tree algorithm,and slightly better than k neighbor classifier;(2)without reducing the overall classification accuracy,the classification accuracy of a few categories can be improved...

Keywords/Search Tags:

Feature selection fusion, Imbalance problem, Cost sensitive learning, The nuclear cell localization, Gram type bacteria

PDF Full Text Request

Related items

1	Study On Feature Extraction And Prediction Algorithm For Subcellular Localization Of Gram-positive Bacterial Protein
2	Research On Dimensionality Reduction Algorithm And Unbalance Problem In Membrane Protein Type Prediction
3	A Research On Automatic Cell Counting Method In Fluorescence Microimaging Based On Deep Learning
4	Classification Of Non-classical Secreted Proteins Of Gram-positive Bacteria Based On Two-layer LightGBM-based Ensemble Model
5	Research On Protein Subcellular Localization Prediction Based On Evolutionary Information And Feature Fusion
6	Protein Sub-nuclear Localization Based On Feature Fusion And Dimension Reduction Algorithm
7	A Novel Approach To Product Quality Control In Industry Based On Ensemble Learning
8	Microrna Prediction Using SVM Based On Imbalance Dataset
9	A Multi-feature Fusion Algorithm For LncRNA Subcellular Localization Prediction Problem
10	Prediction Of Protein SUMO Modification Sites Based On Cost-sensitive Learning