Font Size: a A A

Basecalling,Polishing And SNP Detection Algorithms For Nanopore Sequencing

Posted on:2023-08-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:N HuangFull Text:PDF
GTID:1520307310463694Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advantages of high throughput,long read length,amplificationfree and direct RNA sequencing,Nanopore long-read sequencing technology based on single-molecule sequencing is being widely used in genome assembly,structural variation detection,base modification detection,fulllength transcript detection and more advanced fields such as prenatal testing,clinical diagnosis,and epidemic prevention and control.However,the high error rate and uneven error distribution of Nanopore sequencing have been an important obstacle to the development of Nanopore sequencing,and the results of downstream analysis based on Nanopore sequencing have been challenged.In this paper,based on the problem of the high error rate of Nanopore sequencing,we comprehensively study the problems of improving the accuracy of basecall,genome assembly,and SNP detection with the erroneous Nanopore sequencing reads from the perspectives of the cause of error and error distribution.The main work and innovations of the study are as follows.(1)A self-attention based basecall algorithm for Nanopore sequencing named SACall is proposed.Due to the instability of Nanopore sequencing speed,each nucleotide base in DNA is measured with an uneven number of raw current signals during sequencing.Besides,due to the certain length of the protein pores in Nanopore sequencing,each current signal is jointly influenced by several bases in DNA.First,for the uneven number of current signals of each base,SACall applies a convolutional neural network to directly extract local features from raw current signals to avoid errors caused by the segmentation of signals and reduce the impact of the uneven number of signals in basecall.Secondly,SACall applies the Transformer self-attention model to calculate the similarity at any two positions in the raw current signals to capture the contextual correlation of each signal for that multiple nucleotide bases jointly determine the raw current signal and a strong correlation exists between the pre-and post-sequencing signals.SACall was compared with several other popular basecall algorithms on the real test datasets,and SACall outperformed other basecall algorithms in terms of read accuracy,assembly quality,and consensus accuracy.(2)A assembly polishing algorithm Neural Polish based on the construction of an alignment matrix and orthogonal bidirectional GRU network is proposed.Currently,the popular neural network-based polishing algorithm for Nanopore sequencing takes the feature of base frequency at each position of the contig and then learns the correlation between base frequencies at different positions by recurrent neural networks to predict the true nucleotide base at each position of the contig.However,this feature extraction method compresses the sequence information from multiple reads into one-dimensional frequency features,which loses the sequence information of each read.Neural Polish proposes an alignment matrix to encode the information from the read-to-assembly alignment,which can easily obtain the base frequency features while retaining the sequence information of each read.Neural Polish applies an orthogonal bidirectional GRU network to calculate the contextual sequence information of each read and the base frequency feature of each position on the contig from the alignment matrix,respectively.On several polishing test datasets,Neural Polish has higher polishing accuracy compared with other popular Nanopore sequencing polishing algorithms.(3)A block divided-and-conquer assembly polishing algorithm named Block Polish is proposed.Different regions on genome assembly have different error rates,and these low and high error rate regions differ in polishing difficulty.Different polishing strategies or different parameter thresholds should be used to guarantee polishing accuracy.However,the existing Nanopore polishing algorithms treat all regions of the assembly equally.Block Polish analyzes the error types and error rate distribution of different regions based on the read-to-assembly alignment and divides the contig into trivial and complex blocks.For the different error characteristics of trivial and complex blocks,Block Polish trained two recurrent neural networks with the same structure but different parameters.One is trained on the data of trivial blocks and the other is trained on the data of complex blocks.The two models are used to predict polished contig for trivial and complex blocks,respectively.On the real human test dataset,Block Polish has a better performance for assembly polishing compared with other popular Nanopore polishing algorithms.(4)A two-stage haplotype-aware SNP detection algorithm with lowcoverage Nanopore sequencing named Nano SNP is proposed.Existing SNP detection methods designed for Nanopore sequencing extract shortrange base frequency features or long-range haplotype features of the adjacent regions of candidate SNP sites from the read-to-reference alignment to detect true SNP sites.Nano SNP adopts a two-stage SNP detection strategy.Firstly,the SNP sites are initially predicted based on the short-range base frequency features around the candidate SNP sites by a recurrent neural network.Then the phasing information is added to each read by Whats Hap.Finally,the SNP sites are re-validated by combining the local base frequency features and long-range haplotype features of the candidate SNP sites using a neural network model coupling a convolutional neural network and a recurrent neural network.On the low coverage Nanopore sequencing datasets,the SNPs identified by Nano SNP have the highest F1-score compared with the SNPs called by other popular Nanopore SNP detection algorithms.
Keywords/Search Tags:Nanopore Sequencing Technology, Basecalling, Assembly Polishing, Single Nucleotide Polymorphism
PDF Full Text Request
Related items