Research On Dimensionality Reduction Algorithm And Unbalance Problem In Membrane Protein Type Prediction

Posted on:2020-12-24

Degree:Master

Type:Thesis

Country:China

Candidate:L Guo

Full Text:PDF

GTID:2370330575489315

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The type of membrane protein is important for understanding its structure and function.With the advent of the post-genome era,traditional methods for predicting membrane protein types through biological experiments are no longer applicable.The machine learning-based method has become a new method to replace traditional biological experiments because of its high efficiency and low cost.In view of the many advantages of machine learning methods in membrane protein type prediction,this paper will conduct in-depth research,including the following aspects:1.Amino acid composition information,materialized information and evolutionary information are three basic information of protein sequences.Based on these information,this paper constructs a rich and effective feature expression method.There are mainly local amino acid composition(LAAC),local dipeptide composition(LDC),tripeptide composition(TC),physical and chemical index and(SPPI),autocorrelation function(ACF),reduction of position-specific scoring matrix(RPSSM),evolutionary differences Position-specific scoring matrix(EDP)and pseudo-position-specific scoring matrix(PsePSSM).Among them,Materialization Index and(SPPI)is a new feature expression method based on AA index database2.After the feature expression process,two problems arise:high-dimensional feature problems and feature heterogeneity problems.When dealing with high-dimensional feature problems,this paper proposes a two-stage feature selection algorithm(MIC-GA)based on maximum information coefficient and genetic algorithm.The MIC-GA can simultaneously obtain the most effective feature subsets for classification and the corresponding optimal classifier parameters.The experimental results confirm the effectiveness of the MIC-GA algorithm in removing redundant features and improving classifier performance.When dealing with the problem of feature heterogeneity,this paper transforms the feature heterogeneity problem into the classifier heterogeneity problem.The Stacking integration method can well handle the characteristics of the classifier heterogeneity problem and indirectly solve the feature heterogeneity problem.3.Membrane protein datasets often have serious imbalances,which can result in low accuracy in a few categories of samples during the prediction process.In this paper,the data is pre-processed before training by using SPMTE oversampling combined with Tomek Link undersampling.Since the SMOTE method is when the data dimension is high,a large amount of noise data is generated.Therefore,before the data resampling,the FReliefF feature selection algorithm(Fuzzy-ReleifF)is proposed based on the fuzzy membership degree by improving the original ReleifF algorithm,and the dimension of the data is reduced.The experimental results illustrate the effectiveness of the method.

Keywords/Search Tags:

membrane protein type prediction, feature expression, feature selection algorithm, ensemble learning, data imbalance

PDF Full Text Request

Related items

1	The Classification Prediction Of High Dimensional Data Of Membrane Protein Based On Multi-feature Fusion
2	Feature Extraction And Learning Algorithm For Protein-ligand Binding Sites Prediction
3	Protein Subcellular Localization Based On Feature Selection And Cost-Sensitive Learning
4	Microrna Prediction Using SVM Based On Imbalance Dataset
5	Research On Ensemble Feature Selection Algorithm For Biomedical Data
6	Research On Feature Engineering And Feature Selection Algorithm Of Biogenetic Data Based On CNN
7	Prediction Of DNA-binding Proteins Based On Comprehensive Characteristics Of Protein Sequences And Ensemble Learning
8	Ensemble Feature Selection For Omic Data
9	Prediction Of Amidation Sites Based On Ensemble Learning
10	Study On Feature Selection And Classification Algorithm For Gene Expression Data