Font Size: a A A

Performance Analysis For The Assembly Of Repeat Space

Posted on:2022-07-24Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y ZhangFull Text:PDF
GTID:2480306608981069Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
The development of genomics has revolutionized biological research,and the numbers of finished and ongoing genome projects are rapidly increasing.The assembly of new genomes are mostly reliant on computational algorithms,however,limitations of genome sequencing techniques have led to dozens of assembly algorithms,none of which is perfect.The performance of such algorithms,as well as insertion size of sequencing libraries,read length,read accuracy,and genome complexity,determine the accuracy and continuity of the genome assembly.So,performance analysis of the resulting assembled sequences is complicated.Due to the abundance of repetitive sequences in the genome and the assembly of repetitive sequences is more difficult,it is very important to analysis the performance of the repetitive sequence space when evaluating the assembly results.For long repeats,some long repeats are often not fully assembled,or even only be assembled once.Therefore,the performance analysis of long repeats is particularly important when assessing the quality of the genome.To evaluate the performance of the repetitive sequences in a new assembly genome,we propose a method to analyze the performance of the repeat space by calculate the depth of each base of the genome which should alignment the short reads to the genome,so does the performance analysis of long repetitions.For the repetitive sequences,the depth of the bases of the repetition are close to the average depth if they are fully assembled,and it is significantly higher than the average depth of the genome if they are not fully assembled after uniforming the depth.That is,the closer the calculated depth of the repetitive sequences to the average depth of the genome,the higher the assembly quality of the sequences is.The depth of the entire genome is the most uniform when the depth of the repetitive sequences is closest to the average depth of the genome.So,it is considered that make the most uniform of the depth as the optimize goal when calculating the depth.This problem is defined as the Depth of Uniform Calculation problem of the genome.This paper formally defines the Genome Depth of Uniform Computation Problem(GDUC)and the Depth of Uniform Computation Problem(DUC Problem)firstly.It proves that GDUC is an NP-complete problem by the Exact Block Covering Problem(EBC).According to Jason's inequality,we propose an optimization model for the GDUC problem,and simplify the problem to propose a local search algorithm to solve the problem.Then,we implemented the algorithm as a command-line-callable program called DCATools.Compared with BEDTools and SAMtools,the depth calculation model of DCATools is effective for uniform depth calculation.For the performance analysis index DCA,we use eight sets of genomic data to calculate the DCA value and the LAI value,then compares the calculation results,which shows that the DCA value can effectively evaluate the assembly quality of the repetitive sequencs.This paper also provides a method of sequence depth storage array length compression to alleviate the problem of insufficient memory usage for depth calculation when the genome is large.This method uses the space of an array element to store the depth of multiple adjacent bases,and the value of the space stored is the sum of the depths of these bases.If the depth value storage of c bases is compressed into one,then the length of DNA is compressed to one-c of the original,and we use the average value of c bases represents the depth of every base.We simulate 20X,50X,and 100X Illumina sequencing reads of the reference of Arabidopsis and Medicago Truncatula respectively,and experiment on these data with sequence depth storage array length compression scale of 1,2,4,6,and 8.And we compared the compression loss in the experiment.When the compression ratio is 2,the accuracy of depth calculation is above 80%,it means that the compression loss of this method is acceptable.
Keywords/Search Tags:Depth Calculation, Repeat Space, Long Repeats, Performance Analysis, Length Compression of Sequence Depth Storage Array
PDF Full Text Request
Related items