Font Size: a A A

Research On Feature Selection Method Of Plant Pentapeptide Protein Based On Ensemble Of Multiple Classifiers

Posted on:2022-08-25Degree:MasterType:Thesis
Country:ChinaCandidate:H X WangFull Text:PDF
GTID:2480306311953169Subject:Forestry Information Engineering
Abstract/Summary:PDF Full Text Request
PPR(pentatricopeptide repeat)protein is a nuclear encoded protein,which contains a number of trigonometric repeat units.It is similar to the trigonometric tetrapeptide repeat(TPR)protein in structure,so it is named PPR.PPR protein family is the largest one in the terrestrial plant family.Most PPR proteins mainly exist in mitochondria or chloroplasts of plants,and are common in eukaryotes and terrestrial plants.There are about 450 PPR genes in Arabidopsis thaliana and 430 PPR genes in rice,which play an important role in plant growth and development.Plant growth will be affected by many external factors that are not conducive to its own growth.Through the analysis of PPR protein,we can find the factors that affect plant growth,and then through the technology of synthetic protein,we can let the plant overcome all the factors that are not conducive to growth.How to improve the data mining method for the existing massive plant data,so as to extract effective features in the face of high-dimensional,high redundancy,small sample feature vector data,is the problem to be solved in this paper.This paper starts with extracting the feature vectors of protein sequences,analyzes the experimental steps of data mining step by step,and proposes a feature selection algorithm that integrates multiple base classifiers,and selects the classifier with the highest score of feature importance when building the model iteration.In order to verify the performance of the proposed algorithm,the main work is as follows:Firstly,the research background,significance and current situation of feature selection and classifier integration are briefly introduced,and the feature extraction method,feature selection algorithm and classifier selection algorithm are introduced in detail.This paper analyzes the reasons for the effectiveness of ensemble classifier,summarizes several classical models of ensemble learning and ensemble classification feature selection model,resampls,trains and scores the training set,obtains the importance score of all features,and then selects the optimal features for verification analysis.Secondly,according to the experimental results of the simulation data set,it is proved that the accuracy of the multi classifier ensemble feature selection algorithm is better than that of the base classifier ensemble feature selection method.Finally,the features extracted from the sequences are used to identify PPR proteins.Based on the algorithm proposed in this paper,a feature selection tool for biological protein sequence is developed.Users can switch and select different classifiers for feature selection.Finally,the selected features can be qualitatively analyzed by ROC curve,projection heat map,confusion matrix and so on.
Keywords/Search Tags:PPR proteins identification, ensemble classifer, resampling, feature importance
PDF Full Text Request
Related items