A Long Read Based De Novo Method To Find Repetitive Element In The Genome

Posted on:2019-11-07

Degree:Master

Type:Thesis

Country:China

Candidate:R Guo

Full Text:PDF

GTID:2370330566961592

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The repetitive elements(repeats)are the identical sequences occurred more than once in the genome.Identifying repeats can help to analyze the species evolution,solve the ambiguity of sequence mapping and reduce the assembly error.Compared with the RepBase repeats library,the repeats generated by methods using short reads are short and incomplete.Long reads include more information than short reads and cover longer repeats,which has the potential to find better repeat library.This dissertation is dedicated to identifying repeats based on long reads.The main contributions are as below:1)RepLong repeat identification method is proposed based on long reads.The MHAP method is firstly utilized to calculate the overlap of long reads.Afterward,a network of read overlaps is constructed based on pair-wise alignment of the reads,where each vertex indicates a read and an edge indicates a substantial overlap between the corresponding two reads.The network communities whose intra connectivity is greater than the inter connectivity are extracted based on network modularity optimization.Finally,representative reads in each community are extracted to form the repeat library.2)RepPeak improves RepLong by solving the community resolution issue of RepLong and its result is more interpretable than RepLong.A reference genome is first assembled based on long reads or a existing reference genome is utilized.Then the long reads are mapped back to the reference genome.The read depth in every position of reference genome is calculated,and the position of sharp changes of read depth are extracted.After merging and removal of false read depth change positions,the reads in those position ranges are extracted as repeats.Comparison studies on drosophila melanogaster and human long read sequencing data with genome-based and short read method demonstrate the efficiency of RepLong and RepPeak in identifying long repeats.The identified repeats are longer and more complete than those methods.The new methods help to solve the fragmentation issue of the repeats identified by short reads method,and contains more complete information.They take advantage of long reads to identify longer repeats without assembling the input reads.

Keywords/Search Tags:

Repetitive element identification, Long-read sequencing, network community detection, RepLong, RepPeak

PDF Full Text Request

Related items

1	Systematic Identification Of Intergenic Long-Noncoding RNAs In Mouse Retinas Using Full-Length Isoform Sequencing
2	Researches On Long Read Alignment Algorithms Oriented To The Third Generation Sequencing Technology
3	Genomic Structural Variant Prediction Algorithm And Software
4	A Single Tube Long Fragment Read(stLFR) Sequencing Technology Based On Co-Barcode Method Research
5	Algorithms and Applications in Genome Assembly using Long Read Sequencing Technology
6	Analysis Of The Effect Of Repetitive DNA Sequence Characteristics On Sequencing Results
7	Data Analysis And Application Of Full Length LncRNA Based On Nanopore Long Read Sequencing Technology
8	Algorithmic Study On Long Read Assembly Error Correction Based On Linked Reads Sequencing Data
9	Optimizing High-throughput Biological Gene Sequencing Data Processing Algorithms Based On Hash
10	The Study On Read Alignment Algorithm For High-throughput Sequencing Datasets