Font Size: a A A

Deep Learning Based Algorithm For Oxford Nanopore Basecalling

Posted on:2022-07-12Degree:MasterType:Thesis
Country:ChinaCandidate:J W ZengFull Text:PDF
GTID:2480306569480794Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Nanopore sequencing technology has attracted widespread attention in research fields such as whole genome sequencing and genome assembly due to its ultra-long read length.Basecalling is an important step of nanopore sequencing.In this step,the basecalling algorithm is used to analyze sequencing signals to identify base sequences of DNA or RNA.At present,the error rate of basecalling is still high(about 10%-15%),affected by sequencing noise,homopolymer and base modification.In addition,most basecalling algorithms use deep learning models based on recurrent neural network(RNN)for base identification.The identification processes of such algorithms are slow due to the recurrent structure.In this paper,we proposed a modular basecalling algorithm to further improve the accuracy and speed of basecalling.In the proposed algorithm,a temporal convolutional network(TCN)based deep learning model act as the core for base identification.The main work is as follows.First,we clarified the research background,significance and current situation of nanopore basecalling,described several representative basecalling software,and analyzed the characteristics of the basecalling algorithm used by each software.Then,we proposed a modular basecalling algorithm,aiming at the problems of slow speed and redundant sequencing signal interference in algorithms based on RNN.The core module of the algorithm is the deep learning model for base identification.We designed an end-to-end model called CausalCall,which is mainly formed by a modified TCN and a connectionist temporal classification(CTC)decoder.CausalCall uses dilated causal convolution activated by the gated linear unit to model the characteristics of the sequencing signals.Taking advantage of the convolution operation,it will speed up the base identification process.By controlling the receptive field of convolution layers,it ensures that only the input data in effective range will be used for making model decisions,thereby reducing the interference of redundant information.Based on the proposed algorithm,we designed a simple and easy-to-use basecalling software.Finally,we used nanopore sequencing data of multiple species to evaluate the proposed algorithm.Compared with other basecalling software,the proposed algorithm has superiority in basecalling accuracy and speed,as well as in reference-based genome assembly.In conclusion,the main achievements of this paper can be summarized as follows:(1)We comprehensively analyzed the characteristics of nanopore sequencing data and existing basecalling algorithms,and then proposed the modular basecalling algorithm which uses a deep learning model for base identification.Taking advantage of the strong ability in modeling timeseries data,the controllable receptive field of convolution and the fast calculation speed of TCN,we designed CausalCall model with a CTC decoder,which can effectively improve the accuracy and speed of basecalling.(2)Based on the proposed algorithm,we designed a simple and easyto-use basecalling software.(3)In addition to theoretical analysis,we used nanopore sequencing data of multiple species to evaluate the proposed algorithm,in further,completed a genome assembly experiment using klebsiella sequences.The results show that the proposed algorithm has high basecalling accuracy and speed.Sequences from the proposed algorithm can be assembled into a high-quality genome,which indicates the great practical value of our algorithm in nanopore-based genome research.
Keywords/Search Tags:Nanopore Sequencing, Basecalling, Temporal Convolutional Network, Connectionist Temporal Classification
PDF Full Text Request
Related items