Font Size: a A A

Research Of Bioinformatics On Transcription Regulation

Posted on:2021-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:M ZhangFull Text:PDF
GTID:2370330602489026Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Transcriptional regulation is a key step in gene expression and a necessary link for normal life activities of organisms.The transcriptional regulation mechanism is influenced by promoters,histone post-translational modification and other factors.Once these factors are absent or mutated,they can lead to serious human diseases.Experimental methods such as high-resolution mass spectrometry are time-consuming and laborious.And for this,the paper aims to develop an efficient and high-precision prediction model based on machine learning classification algorithm for promoters and histone post-translational modification sites from the perspective of data imbalance processing,so as to shorten the experimental workload.The main results are as follows:(1)Aiming at the problem that the existing models have low accuracy in recognizing promoters and their specific types,this paper proposes a multi-layer computational approach,called MULTiPly.It extracted the local information of sample sequences using k-tuple nucleotide composition and autocorrelation composition of dinucleotide,and global information by the feature encoding methods of Bi-profile Bayes and k nearest neighbor.Specifically,the F-score and incremental feature selection methods were applied to construct optimal feature combination to further improve the classification accuracy of the model.Moreover,in order to deal with the extreme imbalance in the number of different types of promoters,five subclassifiers were developed in the second prediction task to determine the type of promoter one by one.Comprehensive benchmarking experiments using 5-fold cross-validation,jackknife test and independent test consistently showed the effectiveness of the proposed MULTiPly approach,especially for distinguishing specific types of promoters.(2)Lysine formylation is a reversible type of protein post-translational modification and has been found to be involved in a myriad of biological processes.This paper first introduced and integrated most distant undersampling and the safe-level synthetic minority oversampling techniques to establish a 'balanced training dataset.Four effective feature extraction methods,namely Bi-profile Bayes,k nearest neighbor,amino acid physicochemical properties and composition,transition and distribution were employed to encode the surrounding sequence features of potential formylation sites.Finally,we built the ensemble Formator model.Performance comparison results on the jackknife test and independent test indicate that Formator significantly outperforms the only calculation tool,LFPred.(3)This paper summarized the computational tools mentioned in more than 40 important literatures related to prokaryotic promoters since 2000,and studied the development trend of promoters in bioinformatics.These tools were classified by computational methods including scoring function-based,machine learning-based and deep learning-based methods in terms of their calculated features,algorithm,performance evaluation strategy,software usability and specific species studied,and ranked based on publication year.Then,extensive independent tests were performed to assess the robustness and scalability of the reviewed prokaryotic promoter prediction methods with online webserver or stand-alone program using our carefully prepared independent test data sets from RegulonDB,DBTBS database.
Keywords/Search Tags:transcription regulation, promoter, lysine formylation site, resampling methods, machine learning
PDF Full Text Request
Related items