Font Size: a A A

Research And Optimization On Sequence Alignment Algorithm For Third-Generation Sequencing

Posted on:2020-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:S Y SongFull Text:PDF
GTID:2370330575464571Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years,the development of third-generation sequencing technology has brought great changes and influences to genomics.However,due to the long average length and high error rate of the third-generation sequencing sequence,the existing se-quence alignment algorithm for third-generation sequencing consumes a large amount of time in the workflow of the data analysis.Therefore,how to quickly and accurately align large-scale sequencing sequences to reference genomes is a major challenge for third-generation sequence alignment.At present,mainstream algorithms mostly use the seed-and-extend method,including filtering candidate positions and verifying.Filtra-tion and validation are the key factors affecting the performance of the algorithm.In order to accelerate the speed of sequence alignment,the feature selection of the filtering method and the indexing technology in the validation phase is studied in depth in this thesis.The main work and contributions are as follows:(1)Design and optimization of filtering methodThe existing filtering methods are analyzed.They use all the seeds to filter can-didate positions.So the number of seeds to be processed is too many and the perti-nence is not strong,which is not efficient during the stage of filtering.Our experiments show that low-frequency seeds show higher discrimination when filtering,and low-trequency seeds can also effectively reduce the amount of calculation.Based on this,a filtering method based on low-frequency seeds is proposed in this thesis.The low-frequency seeds are selected dynamically according to the size of the genome,and the low-frequency seeds are used to vote to locate candidate regions.The number of can-didate regions obtained by filtering is also an important criterion for filtering methods.In order to further reduce the number of candidate regions,we optimize the filtering method and propose three heuristic strategies:merging adjacent windows,judging and validating candidate windows,and transforming seed region,which re-filter candidate positions on the premise of guaranteeing sensitivity.The experimental results show that when the frequency range of seeds is set to 20%,the proposed filtering method can greatly reduce the time consumption in the filtering stage,which is about 10 times faster than the existing filtering methods.At the same time,the optimized filtering method can reduce about 70%of candidate positions.(2)Index modification and improvement of verification methodIn the validation phase,candidate regions need to be extended to obtain the final result of the alignment.The existing methods usually adopt an index to construct the op-timal coverage chain to reduce the scope of comparison.Nevertheless,when adopting a global index,the long length of third-generation of sequencing sequence will lead to a large number of misplaced and invalid anchors need to be handled in the chain stage.To solve this problem,we design a segmented hash index and propose a verification method based on the segmented hash index.The index and candidate regions are seg-mented separately.The number of invalid anchors is reduced by using the restriction of location relationship,and the chain process is accelerated,thus reducing the time con-sumption in the verification phase.The experimental results show that the improved method can increase the validation time by more than 30%.Combining the proposed filtering method with the validation method,the whole process experiment of sequence alignment on the Arabidopsis genome and the human genome is carried out.Compared with the existing third-generation sequencing sequence alignment algorithm,the overall speed is increased by 2-5 times.
Keywords/Search Tags:Third-generation sequencing, sequence alignment, seed-and-extend meth-ods, low-frequency seeds, index
PDF Full Text Request
Related items