Font Size: a A A

The Study On Read Alignment Algorithm For High-throughput Sequencing Datasets

Posted on:2020-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:J P SunFull Text:PDF
GTID:2370330602952530Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of high-throughput sequencing technology and the decline in sequencing costs,the scale of short reads is getting larger and larger.Quickly and accurately aligning a large number of short read sequences onto a reference genomic sequence remains a challenging problem.Aiming at that the short read sequence alignment algorithm based on hash index runs with large memory and low precision,this paper proposes a short read sequence alignment algorithm SSFA.We design a lightweight hash index structure which takes up less memory space than the traditional hash index-based alignment algorithm.We also design the seed grouping algorithm and filtering algorithm to improve the accuracy and speed of the alignment.Filtering algorithm include seed selection algorithm and candidate location filtering algorithm.The specific work is as follows:The structural design of the lightweight hash index.The traditional hash index-based short read sequence alignment algorithm needs to store the fixed length subsequence generated by each position of the reference genome sequence and the current position information into the hash index when constructing the index,resulting the hash index occupy more memory space.In order to solve this problem,the SSFA establishes a lightweight hash index structure,and stores the fixed length subsequence corresponding to the current position and the current position information on the reference genome sequence every fixed length step.In the constructed lightweight hash index structure,the key length table is used to store the fixed length subsequence,and the location table stores the position information corresponding to the subsequence.Seed group and filter algorithms.In order to deal with the problem that the lightweight hash index is missing part of the reference genome sequence position information,in the process of short read alignment,we propose a seed grouping and filtering algorithm.First,all seeds generated by the short read sequence are divided into step seed sets according to their positions in the short read sequence by using the seed grouping algorithm.Then,each seed set is filtered by a filtering algorithm to obtain a candidate position corresponding to the short read sequence.In the filtering process,we design a seed selection algorithm which based on dynamic programming.From each seed set,the algorithm select d + a + 1(d is Hamming distance,a is the number of additional non-overlapping seeds)optimal nonoverlapping seeds.Ensure that this d + a + 1 non-overlapping seed combination is the smallest combination of the frequency of occurrences on the reference genome sequence.We designed a candidate location filtering algorithm to filter out the d + a + 1 nonoverlapping seed's redundant locations and use the remaining locations as candidate locations corresponding to the seed set.Finally,the candidate positions corresponding to all the seed sets are merged,and all the merged candidate positions are compared using the Smith–Waterman algorithm,and the most appropriate position is selected as the result output.The SSFA recall rate,accuracy,sensitivity,runtime,and memory footprint were tested on real datasets and simulation datasets downloaded on NCBI.The experimental results show that SSFA has certain advantages in comparison with mainstream algorithms in accuracy and sensitivity.It is comparable to mainstream algorithms in recall rate and running time.The memory consumption is reduced by 15%~24% compared to other hash index-based algorithms.
Keywords/Search Tags:short read alignment, hash index, filter, seed
PDF Full Text Request
Related items