Font Size: a A A

Research On Dimensionality Reduction Algorithm And Unbalance Problem In Membrane Protein Type Prediction

Posted on:2020-12-24Degree:MasterType:Thesis
Country:ChinaCandidate:L GuoFull Text:PDF
GTID:2370330575489315Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The type of membrane protein is important for understanding its structure and function.With the advent of the post-genome era,traditional methods for predicting membrane protein types through biological experiments are no longer applicable.The machine learning-based method has become a new method to replace traditional biological experiments because of its high efficiency and low cost.In view of the many advantages of machine learning methods in membrane protein type prediction,this paper will conduct in-depth research,including the following aspects:1.Amino acid composition information,materialized information and evolutionary information are three basic information of protein sequences.Based on these information,this paper constructs a rich and effective feature expression method.There are mainly local amino acid composition(LAAC),local dipeptide composition(LDC),tripeptide composition(TC),physical and chemical index and(SPPI),autocorrelation function(ACF),reduction of position-specific scoring matrix(RPSSM),evolutionary differences Position-specific scoring matrix(EDP)and pseudo-position-specific scoring matrix(PsePSSM).Among them,Materialization Index and(SPPI)is a new feature expression method based on AA index database2.After the feature expression process,two problems arise:high-dimensional feature problems and feature heterogeneity problems.When dealing with high-dimensional feature problems,this paper proposes a two-stage feature selection algorithm(MIC-GA)based on maximum information coefficient and genetic algorithm.The MIC-GA can simultaneously obtain the most effective feature subsets for classification and the corresponding optimal classifier parameters.The experimental results confirm the effectiveness of the MIC-GA algorithm in removing redundant features and improving classifier performance.When dealing with the problem of feature heterogeneity,this paper transforms the feature heterogeneity problem into the classifier heterogeneity problem.The Stacking integration method can well handle the characteristics of the classifier heterogeneity problem and indirectly solve the feature heterogeneity problem.3.Membrane protein datasets often have serious imbalances,which can result in low accuracy in a few categories of samples during the prediction process.In this paper,the data is pre-processed before training by using SPMTE oversampling combined with Tomek Link undersampling.Since the SMOTE method is when the data dimension is high,a large amount of noise data is generated.Therefore,before the data resampling,the FReliefF feature selection algorithm(Fuzzy-ReleifF)is proposed based on the fuzzy membership degree by improving the original ReleifF algorithm,and the dimension of the data is reduced.The experimental results illustrate the effectiveness of the method.
Keywords/Search Tags:membrane protein type prediction, feature expression, feature selection algorithm, ensemble learning, data imbalance
PDF Full Text Request
Related items