Font Size: a A A

Improving genome assembly

Posted on:2006-03-08Degree:Ph.DType:Dissertation
University:University of Maryland, College ParkCandidate:Ustun, CevatFull Text:PDF
GTID:1452390008470077Subject:Biology
Abstract/Summary:
We present a reliable, easy to implement algorithm to generate a set of highly reliable overlaps based on identifying repeat k-mers. Our method is coverage independent. Whereas traditionally reads have been trimmed to have expected error rates of 2%, we find our error correction allows extending usable sequence in reads to 16% trimming. We use a version of the Phrap assembly program that uses only overlaps computed by the UMD overlapper, called Phrap UMD. We integrate the UMD algorithms with Baylor's ATLAS assembler applied to Rattus novegicus. Starting with the same data as the Nov. 2002 ATLAS assembly, we compare our results to 4.5 Mbp of rat sequence in 21 BACs that have been finished. We find that after extension and error correction, (i) the reads are 30% longer than reads trimmed to 2%; (ii) the average error rate across the extended reads is about 3 in 10,000 bases, with 88% of the extended reads matching finished sequence exactly across their entire length; and (iii) PhrapUMD with these reads and our reliable overlaps produces a draft assembly of the rat which has no misassemblies and increases the coverage of finished sequence from 92.2% to 95.7%, while simultaneously reducing the base error rate for quality 20 or higher bases from 1.50 to 0.87 errors per 10,000.
Keywords/Search Tags:Error, Assembly
Related items