Font Size: a A A

Some statistical issues in genomics: Shotgun DNA sequence assembly and cDNA expression data

Posted on:2003-09-14Degree:Ph.DType:Thesis
University:University of Southern CaliforniaCandidate:Li, XiaomanFull Text:PDF
GTID:2462390011487666Subject:Biology
Abstract/Summary:
Recently many genome sequencing projects have used double-end clones. The first part of my thesis is to predict genome coverage in these projects. The traditional Lander-Waterman formulas can only address the statistical properties for the assembly projects using clones without "mate-pairs". Therefore, there is a need to extend the Lander-Waterman formulas to cover double-end genome sequencing.;We improve previous results and calculate the average number and length of scaffolds, islands, gaps, etc. In addition, we estimate the distribution of the gap size between adjacent islands. Instead of fixed-length clones and fixed-length ends, here we allow general statistical distributions of these quantities.;Another topic in the first part is to estimate gap sizes after assembly has been done. This is needed especially when there is no high-resolution map. We have developed some methods and compared them with current methods on real data. This work will provide some guidance for estimating gap sizes after assembly.;In the second part, we predict the repeat structure of a genome by using reads without assembly. The repeats in the human genome cause every assembler to make mistakes. To give some clues before assembly, we estimate the repeat structure of a genome from the l-tuple information contained in reads. Our results agree with both simulations and experiments very well. In addition, it provides a consensus estimate for some repeat families in the genome that will help the assembly process. It can be used to provide a better repeat masker as well. Furthermore, it gives an estimation of the genome size.;The third part is focused on statistical analysis of cDNA microarray data. It is important to statistically determine significantly up or down regulated genes. We have modified Kerr and Churchill's method to approach the problem. Our method enables scientists to be more confident about the conclusions drawn from data. It has been incorporated into commercial software and patented.
Keywords/Search Tags:Assembly, Data, Genome, Statistical, Part
Related items