Font Size: a A A

Research On Identification Method Of N6-methylation Sites Based On Machine Learning

Posted on:2022-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y ZhaoFull Text:PDF
GTID:2480306515456354Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
N6-methyladenine(6m A)refers to the methylation modification of the sixth nitrogen atom on adenine.It plays an essential role in maintaining the normal transcriptional activity of cells,DNA damage repair,chromatin remodeling,genetic imprinting,embryonic development and tumorigenesis.Traditional experimental techniques are time-consuming,labor-intensive and costly,so it is difficult to identify 6m A sites from high-throughput sequences.Machine learning based computational methods can handle with the identification of 6m A sites in multiple sequences at the same time,which is time-saving,labor-saving and efficient.As an effective supplement to biological experimental methods,it is gradually favored by the biological community.However,the existing computational methods of 6m A sites identification are often limited in following issues,including constructed single classification model,employed single feature and poor performance of cross species identification.To address these issues,the main work of this study is as follows:(1)Evaluate the features of 6m A sites and the models.Based on the evaluation of the existing features of DNA sequences,five features(enhanced nucleotide acid composition,electron-ion interaction pseudopotentials of trinucleotide,nucleotide chemical property,Kmer and ditri KGap)with high identification ability for 6m A sites are selected to form a better feature combination.Feature selection strategy based-XGBoost is implemented to find a better feature subset.The performances of popular models based-conventional machine learning and deep learning models are evaluated.The model with high classification ability for 6m A sites are selected as the candidate classifiers.(2)Construct a 6m A sites identification method based on ensemble learning.As we know,limited performances are generally achieved by single classification model employed in the existing methods.To solve the problem,a 6m A sites identification model based on ensemble learning,termed Stack6 m APred,is constructed in this work.Based on the evaluations on single machine learning models,different ensemble learning strategies are explored.a highperformance 6m A sites identification model based on ensemble learning,termed Stack6 m APred,is constructed.Stack6 m APred is composed of two-layer classification models.The first layer integrates three popular classifiers,including Naive Bayes,support vector machine and Light GBM.The second layer employs logistic regression classifier.(3)Construct a method based on feature fusion for identification of 6m A sites across different species.Most of existing methods rely too much on extensive prior knowledge to design informative,handcrafted features.Moreover,these methods have poor capacity to identify 6m A sites across different species.This thesis constructs a 6m A sites identification model based on feature fusion,termed Fused6 m A,to address the above issues.In Fused6 m A,the trinucleotide coding schema is used to encode the input DNA sequences.Abstract features are extracted from constructed convolutional neural network.These abstracted features are fused with handcrafted features.Support vector machine is trained on the fused features to identify 6m A sites.Fused6 m A is evaluated on 6m A datasets consisting of four species(rice,strawberry,rose and Arabidopsis).10-fold cross validation results on rice dataset show that the feature fusion strategy can effectively improve the prediction performance of the model,and the accuracy is 6.2%,2.9% and 0.9% higher than the existing methods including i6 m APred,i DNA6 m A and i6 m A-DNC,respectively.The model trained on rice dataset is used to identify the 6m A sites of strawberry,rose and Arabidopsis,respectively.The results showed that Fused6 m A had stronger transfer learning capability than i6 m A-Pred,i DNA6 m A,MM-6m APred and i6 m A-DNC.
Keywords/Search Tags:N6-methyladenine(6mA), Stacking ensemble learning, Trinucleotide coding, Convolutional Neural Network, Feature fusion
PDF Full Text Request
Related items