Font Size: a A A

Prediction Of Transcription Terminator And Origin Of Replication Based On Sequence Informatics

Posted on:2022-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:W R WangFull Text:PDF
GTID:2480306554971219Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Human reproduction and growth are mainly carried out through cell reproduction,gene replication and transcription,of which terminator and origin of replication are the integral part of managing the process.Therefore,solving the related sequence prediction problem can not only optimize the genome annotation,but also help solve the clinical genotype disease.However,the traditional biological experiment is timeconsuming and the accuracy is not stable.Therefore,based on the information characteristics of biological sequences and combined with the machine learning classification model,we carried out research on DNA transcription termination sequence and replication start site sequence of different species.The main research contents are as follows:(1)Terminator is a DNA sequence that gives the RNA polymerase the transcriptional termination signal.Identifying terminators has considerable application value in disease diagnosis and therapies.However,accurate prediction methods are deficient and in urgent need.In this study,we collected the data of Escherichia coli and Bacillus subtilis and employed five feature extraction methods(Pse KNC-?,Pse KNC-?,K-pwm,Base-content,Nucleotidepro)to formulate raw samples.The two-step was performed for feature selection.In training based on optimized features,we compared five single models as well as 16 ensemble models.In the end,we proposed a prediction method “iterb-PPse” for terminators by incorporating 47 nucleotide properties into Pse KNC-? and Pse KNC-? and utilizing Extreme Gradient Boosting to predict terminators based on Escherichia coli and Bacillus subtilis.As a result,the accuracy of our method on benchmark dataset achieved 99.88% after 100 times five-fold crossvalidation test.(2)The origin is the starting site of DNA replication,it is an extremely critical part of the informational inheritance between parents and children.More importantly,accurate identification of the origin of replication has great application value in the diagnosis and treatment of diseases related to genetic information errors.Therefore,we carried out research on the identification of origin of replication in a variety of eukaryotes and proposed a unique prediction method for each species.Throughout the experiment,we collected data from 7 species,including H.sapiens,M.musculus,D.melanogaster,A.thaliana,K.lactis,P.pastoris,and S.pombe.In addition to the commonly used sequence feature extraction methods Pse KNC-? and Base-content,we also designed a feature extraction method based on TF-IDF.Then two-step is used for feature selection.After comparing a variety of traditional machine learning classification models,the Multi-layer perception was employed as the classification algorithm.After 100 times five-fold cross validation,the prediction accuracy of the benchmark set of the above-mentioned seven species of the method we designed reaches 92.60%,90.86%,91.22%,96.15%,94.20%,99.86%,respectively.
Keywords/Search Tags:Transcription terminator, Origin of replication, Pse KNC, TF-IDF, XGBoost, MLP, STREME
PDF Full Text Request
Related items