| DNA N4-methylation and replication initiation sites are ordinary and significant epigenetic mechanisms.Among them,N4-methylcytosine(4m C)is an important methylation modification widely existing in prokaryotes.It plays a crucial role in regulating DNA replication,participating in restriction modification system and protecting host DNA from damage.Hence,the accurate identification for 4m C sites is greatly significant for understanding biological functions and treating gene diseases.DNA replication is one of the most important life activities in cells.While the replication mechanisms differ between species,they share some commonalities,such as DNA replication initiation sites.Therefore,establishing a powerful identification model to predict DNA replication initiation sites is of great significance for further understanding the gene expression and regulation in the process of cell division.Although some researchers have constructed many prediction models to identify DNA N4-methylation sites and replication initiation sites,according to the final results,their prediction performance are not ideal.With the rapid development of ensemble learning and its successful application in various fields,the relevant methods of ensemble learning will be used to build the identification model in the following research.Here are the primary research results of this paper:(1)Gradient Boosting Decision Tree(GBDT)is used as a feature selection method to construct the prediction model of DNA N4-methylation sites.Firstly,biological sequences are transformed into digital vectors by multi-source feature representation methods,which are the features based on sequence information,Ring-function-hydrogen-chemical properties and DNA physicochemical properties.Subsequently,in feature selection and classification,we use ensemble learning algorithm and other machine learning algorithms for experiments,respectively.Through large quantities of experiments,the integrated learning algorithm GBDT is successfully used as the final feature selection method and SVM is used as the classifier to construct the prediction model of DNA N4-methylation sites.Finally,under the 10 fold cross-validation,the accuracies of the six datasets are 0.851,0.859,0.801,0.87,0.859 and 0.901,respectively.Compared with previous predictors,the results show that our model is more valid.(2)Using stacking algorithm to construct the classifier for identifying DNA replication initiation sites.Firstly,the biological sequence information is transformed into digital vectors by using Ring-function-hydrogen-chemical and dinucleotide spatial autocorrelation.Then,the optimal subset is obtained by using Linear SVC as the feature selection method.Finally,the stacking algorithm is used to build the final classifier,which contains Random Forest,Multinomial NB,Extra Trees,Logistics Regression and Support Vector Machine.Under the10 fold cross-validation,the accuracies of the two datasets reach 93.85% and 96.70%,respectively.Meanwhile,an independent dataset is employed to verify the generalization ability of the prediction model and its accuracy is 89.90%.To further illustrate the advantages of the stacking model,the stacking model is compared with its base classifier model,the identification method proposed in this chapter is also compared to other prediction methods.Both of them show the advantages of the stacking model.By discussing the above,it shows that our stacking model is a feasible and novel tool to identify DNA replication initiation sites. |