Font Size: a A A

Research On Genome Assembly Algorithm Based On Long-read Sequencing

Posted on:2024-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhouFull Text:PDF
GTID:2530306932480724Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
High-quality genome sequences play a vital role in the downstream analysis of genomics and can provide strong support for genome annotation,variation detection,evolution analysis,gene function research,and comparative genomics.Genome assembly is a key step in obtaining the genome sequences of species.In recent years,with the rapid development of high-throughput sequencing technology,genome assembly techniques have also been significantly improved,and it is becoming more and more feasible to obtain high-quality genome sequences.Despite the growing trend of utilizing long,high-fidelity Hi Fi reads for genome assembly,the quality of the assemblies still needs to be improved in limited datasets;at the same time,there may be some redundancy in the assembly due to the existence of repetitive sequences.Therefore,the continuity and accuracy of genome assembly still need to be further improved.The research goal of this paper is to obtain high-quality genome sequences.The main research contents are as follows:(1)A high-quality assembly analysis workflow based on limited dataset of Hi Fi reads and open-source tools is proposed.Based on the general workflow of genome assembly,a preprocessing step of removing Hi Fi adapters is added,and different assembly-related algorithms are evaluated for completeness,accuracy,continuity,and other performance metrics based on Arabidopsis thaliana and Oryza sativa datasets in each step of the general workflow,and the current best-performing assembly tool is selected from each step.This results in a highquality assembly analysis workflow based on Hi Fi Adapter Filt(Hi Fi adapters removal)+Hifiasm(de novo assembly)+ SLR(scaffolding)+ TGS-Gap Closer(gap closing).Experiments have proved that after applying this assembly analysis workflow,the integrity and accuracy of the assembly can be guaranteed as much as possible,and the human dataset can be increased by81.5% and 234.9% compared with the NG50 obtained only in the Hifiasm and Hi Canu,respectively.More continuous sequences facilitate downstream analysis such as gene order,functional genomics,etc.(2)A contig de-redundancy method based on Hi Fi reads,Ref_redundans,is proposed.The method introduces high-quality genomes of the same species for the first time,determines the possible source positions of contigs based on the existing conserved genes and orders,obtains the longest contig in the same region,and then obtains the overall maximum collinear nonredundant set,preventing contigs from different regions from being incorrectly removed due to repetitve sequences,and it is equipped with optional modules to break up misassembled contigs based on the high-quality genomes,and further extend and fill gaps to improve assembly quality.The adaptability and correctness of Ref_redundans for redundancy removal in different de novo assemblies of Arabidopsis thaliana,the generalisability of redundancy removal by applying human datasets,and the superiority of redundancy removal between the Ref_redundans and the Redundans method,which is designed based on heterozygous sequences leading to redundancy,are tested in three sets of experiments,which demonstrated that Ref_redundans method can effectively remove redundant contigs correctly and with good generalisation.(3)A web platform for redundancy removal in genome assembly based on Ref_redundans has been built.The platform enables simple and easy removal of redundant contigs caused by repetitive sequences,as well as optional scaffolding and gap closing steps for improving the assembly quality,and visualises the platform’s operational progress log and the performance comparison chart for continuity and integrity between Ref_redundans and Redundans.
Keywords/Search Tags:long-read sequencing, genome assembly, assembly workflow, redundant contigs removal
PDF Full Text Request
Related items