Font Size: a A A

Prediction Of DNA N4-methylcytosine Modification Sites Based On Deep Learning

Posted on:2022-10-09Degree:MasterType:Thesis
Country:ChinaCandidate:D CuiFull Text:PDF
GTID:2480306548961049Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
DNA N4-methylcytosine(4mC)modification is an important epigenetic modification in prokaryotic DNA,because it plays a vital role in regulating DNA replication and protecting host DNA from degradation and other biological processes.Therefore,the accurate identification of 4mC sites is helpful for in-depth study of its biological functions and mechanisms.However,the experimental identification of 4mC sites is time-consuming and expensive,especially considering the rapid accumulation of gene sequences,it is urgent to find an effective calculation method to improve it.In this study,we combine CNN,Attention mechanism and Bi-LSTM to solve the above problems,and constructe a multi-layer deep learning prediction system to identify DNA N4-methylcytosine modifications.The main work content is as follows:(1)For data input,we use the chemical properties of nucleotides for describing DNA sequences,which have been widely used for DNA modification sites identification.Therefore,nucleotides are represented as one-dimensional vectors,and then 3-mer ID features are constructed based on the above chemical property features to turn the sequences into integer vectors,and experiments show that the above operations can better express the relevant information of the DNA sequence.(2)In this thesis,a deep learning network framework based on Bi-LSTM and Attention is constructed to identify DNA N4-methylcytosine sites modifications.The framework accepts the above features as input.First,multiple convolution modules are used to automatically learn the information features in the 4mC sites sequence.In order to better capture and understand the importance of the sequence context,we further introduce attention mechanism into the model.The attention mechanism takes the feature vector after the convolution pooling operation as input,and then calculates a score to indicate whether the neural network should pay attention to the sequence feature of the position.This operation is conducive to improving our prediction results.In order to prevent overfitting,we use early stopping to get better generalization performance.(3)The experimental part is first tested on the six data sets(elegans,melanogaster,thaliana,coil,subterraneus,pickeringii)constructed by Chen et al.to verify the effectiveness of the network framework in this thesis.In the following,we will name it the 4mC small data sets.Secondly,we selected three large sample data sets constructed by Zeng et al.for cross-validation.In the following,we uniformly name the 4mC large data sets.After many experiments,the 4mC sites prediction results obtained in this thesis on the above two data sets are better than the existing algorithms.Finally,the above framework in this thesis is tested on the DNA N6-methyladenine dataset(Rice),and the results show that the Attention4 m C framework also performs well on the 6m A-Rice dataset.Finally,the thesis summarizes the work on 4mC sites and prospects for future research directions.
Keywords/Search Tags:4mC sites, Attention mechanism, Bi-directional LSTM, 3-merID feature
PDF Full Text Request
Related items