The Application Of Integrated Learning Algorithm In DNA N4-Methylation And Replication Initiation Sites Identification

Posted on:2023-10-29

Degree:Master

Type:Thesis

Country:China

Candidate:Y Y Yao

Full Text:PDF

GTID:2530306911984709

Subject:Applied Mathematics

Abstract/Summary:

PDF Full Text Request

DNA N4-methylation and replication initiation sites are ordinary and significant epigenetic mechanisms.Among them,N4-methylcytosine(4m C)is an important methylation modification widely existing in prokaryotes.It plays a crucial role in regulating DNA replication,participating in restriction modification system and protecting host DNA from damage.Hence,the accurate identification for 4m C sites is greatly significant for understanding biological functions and treating gene diseases.DNA replication is one of the most important life activities in cells.While the replication mechanisms differ between species,they share some commonalities,such as DNA replication initiation sites.Therefore,establishing a powerful identification model to predict DNA replication initiation sites is of great significance for further understanding the gene expression and regulation in the process of cell division.Although some researchers have constructed many prediction models to identify DNA N4-methylation sites and replication initiation sites,according to the final results,their prediction performance are not ideal.With the rapid development of ensemble learning and its successful application in various fields,the relevant methods of ensemble learning will be used to build the identification model in the following research.Here are the primary research results of this paper:(1)Gradient Boosting Decision Tree(GBDT)is used as a feature selection method to construct the prediction model of DNA N4-methylation sites.Firstly,biological sequences are transformed into digital vectors by multi-source feature representation methods,which are the features based on sequence information,Ring-function-hydrogen-chemical properties and DNA physicochemical properties.Subsequently,in feature selection and classification,we use ensemble learning algorithm and other machine learning algorithms for experiments,respectively.Through large quantities of experiments,the integrated learning algorithm GBDT is successfully used as the final feature selection method and SVM is used as the classifier to construct the prediction model of DNA N4-methylation sites.Finally,under the 10 fold cross-validation,the accuracies of the six datasets are 0.851,0.859,0.801,0.87,0.859 and 0.901,respectively.Compared with previous predictors,the results show that our model is more valid.(2)Using stacking algorithm to construct the classifier for identifying DNA replication initiation sites.Firstly,the biological sequence information is transformed into digital vectors by using Ring-function-hydrogen-chemical and dinucleotide spatial autocorrelation.Then,the optimal subset is obtained by using Linear SVC as the feature selection method.Finally,the stacking algorithm is used to build the final classifier,which contains Random Forest,Multinomial NB,Extra Trees,Logistics Regression and Support Vector Machine.Under the10 fold cross-validation,the accuracies of the two datasets reach 93.85% and 96.70%,respectively.Meanwhile,an independent dataset is employed to verify the generalization ability of the prediction model and its accuracy is 89.90%.To further illustrate the advantages of the stacking model,the stacking model is compared with its base classifier model,the identification method proposed in this chapter is also compared to other prediction methods.Both of them show the advantages of the stacking model.By discussing the above,it shows that our stacking model is a feasible and novel tool to identify DNA replication initiation sites.

Keywords/Search Tags:

DNA N4-methylation site, DNA replication initiation site, ensemble learning, Gradient Boosting Decision Tree, Stacking

PDF Full Text Request

Related items

1	Study On Gradient Boosting Decision Tree And Its Improvement
2	Research And Application Of Optimizing Survival Analysis Method By Gradient Boosting Tree
3	Research And Application Of Spandex Product Sales Forecast Technology
4	Research On Fog Weather Forecast Based On Machine Learning Method
5	The Research On Prediction Method Of DNA N6-methyladenine Sites And DNase Ⅰ Hypersensitive Sites Based On Ensemble Learning
6	Prediction Of Enhancers And N4 Methylation Sites Based On Ensemble Learning And Deep Learning
7	Recognition Of Translation Initiation Site And Splicing Site In Eukaryote Genome
8	A Quantitative Prediction Model For Albumin And Urea Synthesis Of In Vitro Liver Tissues Based On Gradient Boosting Decision Tree
9	Protein Ubiquitylation And Sumoylation Site Prediction Based On Ensemble And Transfer Learning
10	Research On PiRNA And Promoter Based On Sequence Information