Font Size: a A A

DNA Elements And Recombination Hotspots Identification Based On Sequence Information

Posted on:2018-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:R LongFull Text:PDF
GTID:2310330533969247Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of high-throughput sequencing technology,many DNA sequence data have been generated.However,biological experiments are time-consuming,the structure and function data of these sequences are growing slowly,it's urgent for us to find a new method to analyze these data.Based on sequence information of DNA(nucleotide composition information,physicochemical properties information and position information)and machine learning technology,this paper investigated the DNA sequences involved in gene expression regulation(DNase I hypersensitive sites,enhancer and promoter)and gene replication(recombination hotspots).DNase I hypersensitive sites are regions of chromatin that are sensitive to cleavage by DNase I enzyme.In this study,we used three different feature extraction methods(k-mers,reverse complementary k-mers and pseudo dinucleotide composition)to extract information of DNA sequence and constructed three basic classifiers by random fo rest.This method can improve the accuracy of the DHSs identification by integrating the ensemble learning strategies with weighted summation and voting.The DHSs are related to many gene regulatory elements,and we further studied two important elements(enhancers and promoters)of them.In the study of enhancer,we extracted the sequence information of enhancer by using pseudo k nucleotide composition method.Combined with support vector machine,a two-layer prediction model was proposed to further predict whether the enhancer was a strong enhancer or weak enhancer,and this model achieved good performance.Enhancer enhances the transcription rate of gene,while the promoter controls the time and degree of gene expression.In the study of promoter,we used the position information of sequence and pseudo k nucleotide composition to extract the sequence information of promoter.Random forest and ensemble learning strategy were used to build classifier.Experimental results on benchmark dataset and independent dataset showed that our method had achieved higher accuracy.The result of feature analysis illustrated that our feature extraction method is effective.This study also explored the recombination hotspots which influence evolution during DNA replication.The pseudo k nucleotide composition and automatic cross-covariance methods were used as feature extraction methods.Combined with the support vector machine,we used different parameters to construct different basic classifiers.By using affinity propagation clustering algorithm to cluster basic classifiers,our method not only guaranteed the performance of the basic classifiers,but also ensured the difference between basic classifiers.Experimental result on benchmark dataset showed that our method had achieved good performance.
Keywords/Search Tags:DNase I hypersensitive sites, enhancer, promoter, recombination hotspots, ensemble learning
PDF Full Text Request
Related items