Font Size: a A A

Research And Application Of The Third Generation Genome Assembly High Repeat Sequence

Posted on:2021-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:S YangFull Text:PDF
GTID:2480306497966719Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The third-generation genome assembly is a method of splicing short sequence data obtained by gene sequencing technology into a complete long genome sequence through steps of alignment and error correction.Genome is an important basis for bioinformatics analysis.In recent years,with the gradual maturity of third-generation sequencing technology,researchers have constructed a large number of genomes of species,and provide strong support for subsequent bioinformatics analysis.At present,the highly repetitive and highly heterozygous genomes that have not been overcome have brought great challenges to the assembly of the third generation of genomes.We urgently need to develop new assembly algorithms to improve the accuracy and resource utilization efficiency of highly difficult genome assembly.The main contents of this thesis include:(1)The efficiency and quality of the two sequence alignment algorithms on highly repetitive genomes have been analyzed through experiments.Sequence alignment is the most influential step in the assembly of genomes using the third generation sequencing data.The accuracy of sequence alignment results also directly affects the accuracy of the assembly results.The Daligner sequence alignment algorithm and the Minimap2 sequence alignment algorithm can obtain more than 80% of the same alignment sequences on highly repeated genomes.However,due to the high overlapping sequence alignment,there are many overlapping sequences.About 20% of low-quality overlapping sequences affected the accuracy of subsequent error correction and comparison.How to filter out low-quality overlapping sequences is a problem that needs to be solved in subsequent research.(2)Aiming at highly repeating genomes,a Minimizer-filter is proposed based on the minimizer.minimizer is one of the reasons for Minimap2 to achieve high efficiency comparison.The Minimizer-filter algorithm first calculates the sequence minimizer and puts it in the hash table.By comparing the minimizer information between the sequences,fast sequence alignment can be achieved.In order to reduce the low-quality sequences in the alignment of high repetitive sequences,use equal intervals K-mer alignment algorithm to filter out the low-quality sequences.Experiments show that high-repeat genomes such as resurrection grass and Chinese cabbage can use highquality genomes to assemble high-quality genomes.(3)Based on the Nextflow task scheduling framework,the Minimizer-filter visual assembly process was implemented,and a variety of highly repetitive genome assembly and evaluation wereperformed.Complete genome assembly includes genome alignment,error correction,sequence alignment after error correction,and assembly.The Minimizer-filter comparison algorithm is only part of it.The Nextflow task scheduling framework can quickly achieve the construction of the minizer-filter assembly process,visualization and other related optimizations.Finally,the assembly and evaluation of multiple highly repetitive genomes were carried out using the Nextflow-based Minimizer-filter assembly process.In summary,this thesis found that the sequencer algorithm based on the minimizer can quickly obtain the alignment results of highly repetitive genomes,and has an alignment accuracy rate of more than 80%.Therefore,we proposed a rapid genome alignment filtering algorithm Minimizer-filter to filter out the effect of low-quality ratio on sequence comparison.Based on the minizer-filter algorithm,a complete visual assembly process was developed and various highly repetitive genome assembly evaluations were performed.
Keywords/Search Tags:The third-generation genome assembly, Three-generation sequencing, Minimizer-filter, High repeat genome
PDF Full Text Request
Related items