Font Size: a A A

Research On Protein Classification Algorithm Based On Feature Engineering

Posted on:2021-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:J L LiFull Text:PDF
GTID:2370330614950013Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the advent of the post genomic era,proteomics,as a milestone of life science research entering the post genomic era,is one of the core contents of life science research in the post genomic era.The research object of proteomics is protein,and the final goal of proteomics is to decipher its structure and function.Protein classification,as an important branch of proteomics,is a research hotspot of bioinformatics.This paper focuses on the research topic of protein classification using machine learning method,and launches two topics of protein classification.The specific research contents are as follows:1.A robust and powerful computational model for classification of sub-Golgi proteins was proposed.In this model,we extracted Pse KNC,k-separated-bigramsPSSM,and Pse PSSM to represent protein sequences.The Adaboost Classifier was used to remove the redundant information contained in the Pse KNC feature encoding.The Random-SMOTE technique was adopted to balance the datasets,and the prediction performance of Random-SMOTE based models is much better than that of those models that did not use.Finally,we used SVM as our predictor.By comparing our method with previous work,we conclude that our method is much more powerful,with accuracy of 96.5%,96.5%,and 96.9% in jackknife cross-validation,independent testing,and 10-fold cross-validation,respectively.2.A ensemble predictor integrating six base classifiers for recognizing T6SEs was constructed.At first,we filtered the most effective feature encodings(kseparated-bigrams-PSSM)from various feature encodings.The results of models trained by different feature encodings show that PSSM-based features are more helpful than other features.Then we demonstrated that SMOTE(Synthetic Minority Oversampling Technique)is significant for most feature encodings.We next compared the performances of different single classifiers,finding that SVC is the most effective single model.We ultimately proved our ensemble classifier to be the most effective and robust T6SE predictor available,both in 10-fold cross-validation and independent testing.Our method is far ahead in terms of Accuracy and Specificity,compared with other existing methods.Taking into account all of our results,we conclude that our method is the best available predictor for screening experimental targets of T6SEs.
Keywords/Search Tags:sub-Golgi protein classification, T6SEs, ensemble learning, Feature engineering, Random-SMOTE, SVM
PDF Full Text Request
Related items