Font Size: a A A

Prediction Of Bacterial SRNA Using Machine Learning Methods

Posted on:2010-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:L G WangFull Text:PDF
GTID:2120360275962376Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
Bacterial sRNAs are a class of small non-coding RNAs with their lengths varying from 40nt to 500nt. They are involved in many biological processes, such as posttranscriptional regulation of gene expression, RNA processing, mRNA stability and translation, protein degradation, plasmid replication and bacterial virulence. Furthermore, one main form of sRNA roles is to regulate gene expression through non-perfect complementary matches between sRNA and 5'UTR of its target mRNAs, and plays an important role in the interaction between bacterial and environments. For instance, the MicF gene encodes a 93-nt sRNA, whose main function is to inhibit translation of ompF, an outer membrane protein. In addition, to modulate intracellular iron usage, the RyhB, a 90-nt sRNA, down regulates a set of iron-storage and iron-usage proteins. Thus, identification of sRNAs is important for us to better understand bacterial characteristics.However, sRNAs genes are difficult to be detected because they do not encode proteins and are relatively small. They are also resistant to single nucleotide mutations and difficult to be disrupted by random transposon mutagenesis. Therefore, it is very important to develop models for genome-scale prediction of bacterial sRNAs. Moreover, the availability of large number of sequenced bacterial genomes provides the solid basis for developing models. Up to the present, several models for prediction of bacterial sRNAs have been developed. Although these methods are generally designed for some particular species, and their reliabilities are lower than that from the models predicting open reading frames, these computational methods have played an important role in identification of bacterial sRNAs. In their recent review, Jonathan and coworkers systematically summarized experimental and bioinformatics approaches for identification of sRNAs, and found that many sRNAs were identified with the aid of bioinformatics tools. Before the year 2000, there were only about a dozen sRNAs genes found in Escherichia coli, and most of which were discovered fortuitously. However, in the next six years, more than 80 sRNAs were known in E.coli through the combination of bioinformatics approaches and experimental verifications.Up to the present, several models for prediction of bacterial sRNAs have been developed. According to our recent review, these methods are generally classified into three categories, namely, comparative genomics-based methods, transcription units-based methods and machine learning-based prediction methods. In contrast to the comparative genomics or transcription units-based methods, machine learning methods have some merits. For example, machine learning-based prediction models can not only predict bacterial-specific sRNAs, but also predict sRNAs with Rho-dependent or Rho-independent terminators. Thus, machine learning-based prediction models provide a general scheme for identification of bacterial sRNA genes.Here we reported two models for genome-scale prediction of sRNAs using machine learning methods, which were for E.coli and Staphylococcus aureus, respectively.In order to construct model for prediction of sRNAs in E.coli, we took all 400 sRNA from proteobacteria as positive training dataset (POS_TN). The remaining 441 not from proteobacteria were taken as positive test dataset (POS_TT). For each sequence in POS_TN or POS_TT, it was firstly divided into sequence windows of 100 nt with 50 nt overlap between windows. If sequence length is less than 100 nt, we just keep it as a sequence window. Then, the redundancy was deleted using Blastcluster. We obtained the negative dataset from E.coli intergenic regions. To construct the models for prediction of sRNAs, each sample was depicted using 88 sequence features and 2 secondary structure features. The sequence features included percent nucleotide composition (A%, C%, G%, T%, A+T%, G+C%, A-T% and G-C%), all 16 dinucleotide percent (AA%, AC%, etc.) and all 64 trinucleotide percent (AAA%, AAC%, etc.), respectively. The secondary structure features are the average free energy, MFE(s), defined as the ratio of secondary structure free energy and sequence length, and MFEI1 as the ratio of MFE(s) and G+C% content. Here the secondary structure free energy was calculated using Vienna RNA package (http://www.tbi.univie.ac.at/~ivo/RNA/). Before model construction, the standard t-test was used to detect the difference of each feature in training dataset . The results indicated that there were 77 features with P values less than 0.001, and we constructed model using these 77 features. The results indicated that the 10-fold cross-validation classification accuracy of the constructed model, sRNASVM, was as high as 92.45%, which had better performance than two existing models.In order to construct model for predict sRNAs in Staphylococcus aureus, we took all bacterial sRNA as positive training dataset (POS_TN). For each sequence in POS_TN, it was firstly divided into sequence windows of 100 nt with 50 nt overlap between windows. If sequence length is less than 100 nt, we just keep it as a sequence window. Then, the redundancy was deleted using Blastcluster. We obtained the negative dataset from Staphylococcus aureus intergenic regions. and the model was evaluated using the positive test dataset containing 9 new sRNAs from literature. To construct the models for prediction of sRNAs, each sample was depicted using 22 sequence features and 2 secondary structure features. The sequence features included percent nucleotide composition (A%, C%, G%, T%, A+T%, and G+C%), all 16 dinucleotide percent (AA%, AC%, etc.). The secondary structure features are the average free energy, MFE(s), defined as the ratio of secondary structure free energy and sequence length, and MFEI1 as the ratio of MFE(s) and G+C% content. Here the secondary structure free energy was calculated using Vienna RNA package(http://www.tbi.univie.ac.at/~ivo/RNA/). Finally, we constructed model using these 24 features. The accuracy of our model on the test set was 55.56%, and there were 304 new sRNAs found in Staphylococcus aureus genome using our model. Through combining the transcription-based prediction methods, we found 18 candidate sRNA sequences, which were further used for experimental verification. There were 11 candidates validated by PCR experiments, and 2 candidate sRNAs were validated by Northern blot hybridization and RACE experiments.In summary, our present work provided support for experimental identification of bacterial sRNAs.
Keywords/Search Tags:sRNA, prediction, machine learning, bacteria
PDF Full Text Request
Related items