Font Size: a A A

Ensemble Classification Based Feature Selection From Gene Expression Profiles

Posted on:2021-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:Q JiaoFull Text:PDF
GTID:2370330605964579Subject:Forestry Information Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of molecular biology and gene chip technology,a large number of gene expression profile data have been generated.Gene expression profile data has the characteristics of high dimension,small sample and large amount of redundancy.Researchers are doing a lot of research,analysis and mining according to the characteristics of this kind of gene expression profile data,and the most used methods are machine learning,pattern recognition,biostatistics,etc.At present,the most commonly used method of differential analysis for gene microarray data is feature selection,which can solve the problem of poor classification accuracy caused by too high dimension,and hope to eliminate irrelevant features through feature selection method,screen out representative differential genes,and improve the performance of the learner under the condition of ensuring classification accuracy.Random forest algorithm is widely used in biological field because of its high classification accuracy,strong anti fitting ability,small marginal effect and complex interaction For different types of data distribution,the random forests feature selection method based on the feature importance score is not accurate.In this paper,an integrated feature selection model based on linear classifier,support vector machine and k-nearest neighbor is studied.The main work is as follows:This paper proposes a feature selection model which integrates linear classifier,support vector machine and k-nearest neighbor.In this model,each base classifier adopts the resampling technology of bagging in the selection of sample number,and adopts the random selection method in the number of features,and realizes the evaluation of each variable by calculating the importance score of features.(2)The experimental results show that the accuracy of the algorithm is better than that of the other three methods.The results of qualitative and quantitative experiments show that the method is effective,the results of integrated support vector experiment are not ideal(3)PPR protein is one of the largest protein families in terrestrial plants,with more than 400 members in most species,plays a key role in plant growth and development.This paper presents and analyzes the qualitative and quantitative results of random forest feature selection algorithm in PPR protein recognition experiments.It is found that the importance value of the variables calculated by the random forest algorithm is consistent with the classification accuracy of 188D and PAAC features,which indicates that the key factors are extracted by the random forest algorithm for PPR protein data.(5)Finally,based on python,this paper develops an application software for feature selection of biological data,which can better help users switch to four algorithms for feature selection of expression profile data,and further users can analyze the ROC curve,projection heat map,true positive rate,false positive rate and other key indicators of the selected features under the specified integrated classification algorithm.
Keywords/Search Tags:PPR proteins identification, random forest, resampling, ensemble linear classifier, feature importance
PDF Full Text Request
Related items