Software Development Of Protein Succinylation Prediction Based On Machine Learning

Posted on:2020-03-27

Degree:Master

Type:Thesis

Country:China

Candidate:K Liu

Full Text:PDF

GTID:2370330590950389

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Lysine succinylation has been proved to be ubiquitous in prokaryotes and eukaryotes,especially in many metabolic enzymes of central metabolism and intermediate metabolism.Succinylation of protein lysine sites is widely involved in cell differentiation,cell metabolism and other important physiological activities,and plays an important role in the central metabolic pathway,disease and other important physiological and pathological activities,which means that it is related to the occurrence of many diseases.Therefore,how to determine which lysine has succinylation sites in the protein sequence is very important for the study of physiological characteristics and the development of related drugs.It often takes a lot of energy and time to identify protein succinylation sites by experimental methods,which seriously affects the development of research in this field.In order to solve this problem,this paper develops a computer software platform as a prediction tool of protein succinylation.The main content of the thesis includes:(1)The very critical step is to extract the features of the samples Each amino acid sequence of the sample is in the form of letters.This paper takes each sample as a text,amino acid letters as words,TF-IDF technology is used to convert letter information into digital information to complete feature construction and get model features.(2)Due to the proportion of positive and negative samples is extremely unbalanced,we must solve the problem of unbalanced sample data before training the selection algorithm model,referring to the common solutions and the characteristics of the data set itself.After experimentation,we use SMOTE algorithm to oversample the positive samples,so that the positive and negative samples set can achieve balance.(3)Through the analysis and comparison of Naive Bayesian,Logic Regression,SVM,Random Forest,Gradient Boosting Machine and AdaBoost Machine Learning Algorithms,it is concluded that Random Forest Classification and Gradient Boosting Machine have better prediction effect on this data set.Then,by comparing the prediction effect of the optimal parameter adjustment model of the two classifiers,the average AUC value of the random forest classifiers model is 0.921 in the five fold cross test,and it performs well in the validation of the independent test set,So the random forest is chosen as the machine learning prediction method in this paper.(4)developing the computer software platform through Java to get the final prediction tool.

Keywords/Search Tags:

Lysine succinylation, TF-IDF, SMOTE, machine learning, random forest algorithm, prediction tool

PDF Full Text Request

Related items

1	Method Development For The Prediction Of Two Types Of Lysine Post-translational Modification Sites Based On Sequence Information
2	Design And Implementation Of Predicting Lysine Succinylation In Proteins By GBM
3	Computational Prediction And Analysis Of Lysine Post-translational Modification Sites Based On Machine Learning Algorithm
4	Proteins Lysine Modification: Database Construction And Bioinformatics Prediction
5	Research On Prediction Method Of Beach Bar Sand Reservoir Based On Machine Learning
6	Research On Prediction Of Phosphorylation Modification Sites Based On Machine Learning
7	Prediction Research Of Protein-Protein Interaction Based On Ensemble Of Support Vector Machine And Random Forest
8	Research On Prediction Method Of Total Organic Carbon In Shale Based On Machine Learning
9	Theoretical Estimation Of Intracellular Concentration Of Metabolites In Micro-organisms
10	Research On Geochemical Abnormity Identification Of Metric Learning And Random Forest