| Long reads are gaining popularity among researchers because they excel at genomic repeats.By analyzing and mining the data characteristics of long reads generated by sequencing technology,research on long read sequence assembly methods to achieve high speed,high accuracy,high continuity of genome assembly,and support and promote a series of important fundamentals in life sciences research is of great significance,such as: confirming the diagnosis of diseases at the genetic level;discovering the risk of potential diseases;providing guidance for personalized medicine;and guiding the next generation of reproductive health.However,the long reads from the third generation sequencing technology,although the sequencing length is long,the sequencing error rate is high,and the genome repeat region and other reasons have brought great obstacles to the study of gene assembly based on long reads.In view of the huge data volume of long readings and long readings,but the high error rate,this paper has made two important steps in genome assembly to make research and improvement.The first is to overlap detection of DNA sequences generated by sequencing technology;the second is the process of assembling contigs into scaffolds.Through the characteristics of the long reads dataset itself,k-mer feature statistical analysis,and the study of related algorithms,this paper proposes a long reads overlap detection algorithm based on k-mer feature distribution,which is based on the k-mer distribution feature.Reliable k-mer,using a two-stage strategy,finally determines the overlap interval.Through the research and analysis of the scaffolding algorithm,as well as the research and analysis of the contigs dataset and long reads dataset,this paper also proposes a scaffolding algorithm based on long reads and the classification of contigs.The combination of classification methods divides contigs into unique contigs and fuzzy contigs.Using unique contigs to create scaffold graphs not only simplifies the complexity of scaffolding graphs,but also improves the accuracy of assembly.For these two algorithms,this article uses no less than two currently popular similar methods for comparison,based on k-mer overlap detection algorithm,using three indicators of accuracy,recall and F1-score compared with MHAP algorithm and minimap2.The scaffolding algorithm based on long reads and contigs classification is compared with similar tools SSPACE-Long Read,LINKS,and np Scarf.The effectiveness of the contigs classification and the repeatability perception framework are analyzed and compared,and the results are all good.The proposed two methods provide new ideas and solutions for the study of sequence assembly. |