Font Size: a A A

A bioinformatic analysis of simple repeats and small proteins in prokaryotic genomes

Posted on:2010-02-08Degree:Ph.DType:Dissertation
University:University of California, Santa CruzCandidate:Samayoa, Josue AFull Text:PDF
GTID:1443390002987926Subject:Biology
Abstract/Summary:
Simple sequence repeats have been found to regulate the expression of genes involved in virulence, immune evasion, and other host-interaction functional categories. Furthermore, these motifs have been found to be over-represented in the genomes of several organisms including Neisseria menigitidis . I investigated three main aspects of simple repeat motifs in the complete genome of Vibrio cholerae El Tor. First, are there over-represented simple repeats and what genes are associated with them? Second, are over-represented simple repeats more prone to length variation? And third, I searched for variable length simple sequence repeats and inverted repeats to identify novel phase variable genes for this organism.;My findings indicted that V. cholerae El Tor does not have the same over-representation of simple repeat motifs as N. meningitidis . Furthermore, I found that all simple repeat motifs, regardless of over-representation, are highly prone to length variation. My search for variable length simple repeats and inverted repeats revealed several putative phase variable genes. Initial attempts at experimental validation were unsuccessful.;Accurate prediction of genes encoding small proteins remains an elusive open problem in bioinformatics. Some of the best methods for gene prediction use either sequence composition analysis or sequence similarity to a known protein coding sequence. These methods often fail for small proteins, however, either due to a lack of experimentally verified small protein coding genes or due to the limited statistical significance of statistics on short sequences.;I have a developed a method that incorporates neural-net predictions for 3 local structure alphabets within a comparative genomic approach to generate predictions for whether or not a given open reading frame encodes for a short protein. I have applied this method to the complete genome for E. coli strain K12 and looked at how well the method performed on a set of 60 experimentally verified small proteins from this organism. Out of a total of 11,467 possible ORFs, I found that 4 of the top 10 and 24 of the top 100 predictions belonged to the set of 60 experimentally verified short proteins.
Keywords/Search Tags:Simple, Repeats, Proteins, Experimentally verified, Genes, Sequence, Found
Related items