Font Size: a A A

Research On Word Embedding And Deep Learning Based Replication Origin And Enhancer Prediction

Posted on:2022-10-04Degree:MasterType:Thesis
Country:ChinaCandidate:F WuFull Text:PDF
GTID:2480306311958379Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Origin of Replication(ORI)prediction is one of the important research topics in bioinformatics.DNA replication is the basic process of genetic information transmission which influences cell division,cell differentiation,gene expression and other processes.Therefore,it is of great significance to realize the identification of ORIs by computational methods to explore the cell replication mechanism,gene expression process,gene mutation process and the pathogenesis of related diseases.Enhancer and their type prediction is another hot topic in bioinformatics.Enhancers are DNA fragments existing in non-coding regions of DNA,which can be divided into strong enhancers and weak enhancers according to the different stimulus intensities.Their main role in biological processes is to stimulate the transcription rate of DNA sequences in coding regions and improve the efficiency of protein synthesis.Changes in the properties and functions of enhancers may cause to many diseases,especially cancer,disorders,and inflammatory gastroenteropathy.Therefore,in-depth study of enhancers and their functional mechanisms is of great significance for understanding pathogenesis of related diseases and developing therapeutic approachesAlthough biological experimental methods get a high accuracy in the identification of attributes and functions of the test sequences,the emergence of a large number of genomic sequences in the post-genome era has highlighted the time-consuming and high-priced defects of experimental methods.Therefore,it is necessary to develop a fast and accurate calculation method to replace the biological experiment method.Based on deep learning,this thesis establishes two corresponding models,one for eukaryotic DNA ORI prediction,one for enhancer and their type prediction.The main research contents are as follows:(1)On the basis of the Convolutional Neural Network(CNN),a new prediction model of Saccharomyces Cerevisiae ORIs is proposed.The continuous 3-gram sequence segmentation method is used to cut each DNA sequence into the composition of trinucleotides,and the sequence feature vectors are constructed by calculating contents of 64 trinucleotides and their corresponding 12 types of physical-chemical properties,which is used as the input of 1D-CNN to realize ORIs identification.It is verified by several experiments that the recognition effect is improved with the increase of the number of network layers.Because the deep network is easy to cause the phenomenon of overfitting,the convolutional neural network with single convolutional layer is used as the final model.The robustness of the proposed model is demonstrated by comparing with the existing methods.(2)Combining natural languages with biological sequences,a prediction model of eukaryotic ORIs is constructed based on Word2vec and the CNN with an embedding layer.The DNA sequence is segmented by the continuous 3-gram word segmentation method to obtain the biological words.The distributed representations of biological words trained by Word2vec is then used to constructs the embedding layer of CNN to realize the identification of the ORIs.In order to improve the usage rate of each word in DNA sequence,the skip 3-gram sequence segmentation method is adopted,thus constructing four enhanced datasets which is used to perform the prediction task.The combination model of optimal word segmentation method and network training mode is selected for each species.The comparison results show that this model has satisfactory performance.Independent test dataset of each species are constructed to test generalization performance of the proposed model.Through experiments,it is found that the proposed models present good results which prove that the proposed models have both excellent identification ability and strong generalization performance.(3)Enhancers and its type recognition model based on the statistics-based sequence segmentation and sequence generation is proposed.In this model,Seq-GAN network is firstly used to generate DNA sequences to expand the data scale of non-enhancers,strong enhancers and weak enhancers.Then,using statistical ideas,the DNA sequence is divided into the biological word composition reasonably.Besides,Word2vec is used to train the distributed representation of biological words.Finally,the CNN with an embedding layer is used to perform the identification task.The experimental results show that the artificial sequences by Seq-GAN have similar nucleotide compositions and physicochemical properties to the natural sequences,and the prediction model performs better than the existing methods in the recognition tasks of enhancers and their types.Besides,the excellent performance on the independent test datasets demonstrates the strong generalization performance and robustness of the proposed model.
Keywords/Search Tags:Origins of Replication(ORIs), Enhancers, Word2vec, Convolutional Neural Network(CNN)
PDF Full Text Request
Related items