Font Size: a A A

A Long Read Based De Novo Method To Find Repetitive Element In The Genome

Posted on:2019-11-07Degree:MasterType:Thesis
Country:ChinaCandidate:R GuoFull Text:PDF
GTID:2370330566961592Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The repetitive elements(repeats)are the identical sequences occurred more than once in the genome.Identifying repeats can help to analyze the species evolution,solve the ambiguity of sequence mapping and reduce the assembly error.Compared with the RepBase repeats library,the repeats generated by methods using short reads are short and incomplete.Long reads include more information than short reads and cover longer repeats,which has the potential to find better repeat library.This dissertation is dedicated to identifying repeats based on long reads.The main contributions are as below:1)RepLong repeat identification method is proposed based on long reads.The MHAP method is firstly utilized to calculate the overlap of long reads.Afterward,a network of read overlaps is constructed based on pair-wise alignment of the reads,where each vertex indicates a read and an edge indicates a substantial overlap between the corresponding two reads.The network communities whose intra connectivity is greater than the inter connectivity are extracted based on network modularity optimization.Finally,representative reads in each community are extracted to form the repeat library.2)RepPeak improves RepLong by solving the community resolution issue of RepLong and its result is more interpretable than RepLong.A reference genome is first assembled based on long reads or a existing reference genome is utilized.Then the long reads are mapped back to the reference genome.The read depth in every position of reference genome is calculated,and the position of sharp changes of read depth are extracted.After merging and removal of false read depth change positions,the reads in those position ranges are extracted as repeats.Comparison studies on drosophila melanogaster and human long read sequencing data with genome-based and short read method demonstrate the efficiency of RepLong and RepPeak in identifying long repeats.The identified repeats are longer and more complete than those methods.The new methods help to solve the fragmentation issue of the repeats identified by short reads method,and contains more complete information.They take advantage of long reads to identify longer repeats without assembling the input reads.
Keywords/Search Tags:Repetitive element identification, Long-read sequencing, network community detection, RepLong, RepPeak
PDF Full Text Request
Related items