Font Size: a A A

Software Development Of Protein Succinylation Prediction Based On Machine Learning

Posted on:2020-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:K LiuFull Text:PDF
GTID:2370330590950389Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Lysine succinylation has been proved to be ubiquitous in prokaryotes and eukaryotes,especially in many metabolic enzymes of central metabolism and intermediate metabolism.Succinylation of protein lysine sites is widely involved in cell differentiation,cell metabolism and other important physiological activities,and plays an important role in the central metabolic pathway,disease and other important physiological and pathological activities,which means that it is related to the occurrence of many diseases.Therefore,how to determine which lysine has succinylation sites in the protein sequence is very important for the study of physiological characteristics and the development of related drugs.It often takes a lot of energy and time to identify protein succinylation sites by experimental methods,which seriously affects the development of research in this field.In order to solve this problem,this paper develops a computer software platform as a prediction tool of protein succinylation.The main content of the thesis includes:(1)The very critical step is to extract the features of the samples Each amino acid sequence of the sample is in the form of letters.This paper takes each sample as a text,amino acid letters as words,TF-IDF technology is used to convert letter information into digital information to complete feature construction and get model features.(2)Due to the proportion of positive and negative samples is extremely unbalanced,we must solve the problem of unbalanced sample data before training the selection algorithm model,referring to the common solutions and the characteristics of the data set itself.After experimentation,we use SMOTE algorithm to oversample the positive samples,so that the positive and negative samples set can achieve balance.(3)Through the analysis and comparison of Naive Bayesian,Logic Regression,SVM,Random Forest,Gradient Boosting Machine and AdaBoost Machine Learning Algorithms,it is concluded that Random Forest Classification and Gradient Boosting Machine have better prediction effect on this data set.Then,by comparing the prediction effect of the optimal parameter adjustment model of the two classifiers,the average AUC value of the random forest classifiers model is 0.921 in the five fold cross test,and it performs well in the validation of the independent test set,So the random forest is chosen as the machine learning prediction method in this paper.(4)developing the computer software platform through Java to get the final prediction tool.
Keywords/Search Tags:Lysine succinylation, TF-IDF, SMOTE, machine learning, random forest algorithm, prediction tool
PDF Full Text Request
Related items