Font Size: a A A

Reasearch On Base Electrical Signal Calling Algorithm Of Nanopore DNA Sequencing Based On Deep Learning

Posted on:2023-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:H MengFull Text:PDF
GTID:2530307061454704Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
DNA is the carrier of genetic information.DNA sequencing technology plays a vital role in medical activities and life science research.DNA sequencing technology has experienced the first-generation sequencing technology represented by dideoxy chain termination,the secondgeneration sequencing technology represented by sequencing by synthesis,and the thirdgeneration sequencing technology represented by single molecule fluorescence sequencing and nanopore sequencing.Among them,nanopore sequencing has attracted people’s attention because of its long read,fast reading speed,single molecule detection and a series of advantages.However,compared with the error rate of less than one thousandth of the second-generation sequencing,the error rate of nanopore sequencing is still high,as high as one tenth to one twentieth,which limits the wider application of nanopore sequencing.In nanopore sequencing,DNA molecules pass through nanopores,the corresponding base sequence can be decoded by analysing the change of ion current when passing through nanopores.This process is called basecalling,and the tool for basecalling is generally called basecaller.Basecaller has developed through hidden Markov model,event segmentation model and deep learning model.As a leader in nanopore sequencing,the Oxford nanopore technology has launched a variety of sequencers,which are equipped with a basecaller called "Guppy".However,because Guppy is not open source,it is difficult for other researchers to make further improvement based on Guppy.Therefore,by constructing a basecalling algorithm based on deep neural network,this paper designs and compiles an open-source basecaller with the same accuracy as Guppy.Further,the correlation between the training data and the accuracy of the algorithm is studied,targeted optimization is carried out,and the model is successfully extended to the identification of methylated bases.The main work is as follows:(1)A basecalling algorithm based on deep neural network is designed,and an open source basecaller is developed based on this algorithm.The basecalling algorithm in this paper is based on encoder-decoder architecture.The encoder adopts a one-dimensional CNN(Convolutional neural network)structure with convolution separation layer and residual connection,which can not only effectively extract the characteristics of sequencing current,but also greatly simplify the parameter scale of the model;The decoder adopts the combination of two-layer Bi-LSTM and CTC decoding layer,which makes full use of the long-range dependence advantage and bidirectional detection ability of Bi-LSTM in processing time sequence.Therefore,the model combines the advantages of CNN in detection speed and the sequence long-range dependence of RNN(Recurrent Neural Network).The basecaller developed in this paper can achieve a median accuracy of 97.854% on the human test set,and the accuracy on Mus musculus,Arabidopsis,zebrafish and Klebsiella pneumoniae is also equivalent to that of guppy algorithm.(2)The influence of training data on the accuracy of the algorithm is studied,and two methods to optimize the basecalling algorithm from the perspective of training data are proposed.Firstly,by testing the model based on human training set on five species including humans,the results are compared with the proportion distribution of any 5-mers base in the genome of the five species.It is found that the similarity of the proportion distribution of 5-mers base in the genome between the training set and the test set determines the effect of basecalling on the test set,that is,the data set with similar genomic 5-mers base proportion distribution has better test effect.Furthermore,it isproved that supplementing the lack of 5-mers base types to the training set is helpful to improve the accuracy of basecaller;At the same time,aiming at the problem of high basecalling error rate of poly-bases sequence,it is proved that the error rate of basecaller in ploy-bases region can be reduced by constructing a data set with high proportion of ploy-bases sequence.The above work has guiding significance on how to optimize and construct the training set for deep learning basecaller,suggesting that a training set with balanced proportion distribution of 5-mers bases and moderately increasing the frequency of poly-bases has the potential to train a widely applicable model with high accuracy.(3)Based on the deep neural network in the first part of the work,a basecalling algorithm for all base recognition compatible with methylated base 5mC is constructed.Firstly,a five base data set containing 5mC information was made by using bisulfite sequencing results and reference genome,and then the deep neural network based on encoder-decoder architecture in the first part was applied to the data set.The result suggests that the detection accuracy of the trained model for 5mC is similar to that of the 5mC special recognition tool deepsignal,surpassing the existing best performing all base recognition tool Nanopolish,and has important application value in the field of 5mC basecalling.(4)An automatic analysis software EasyNanopore for nanopore translocation event detection is constructed.In view of the current needs of some nanopore molecular detection experiments for statistical analysis of the characteristics of translocation events,the event automatic detection software EasyNanopore is developed,which adopts multi-process mode to accelerate the detection process,provides a user-friendly graphical interface,and does not need to configure any operating environment in advance.The results show that EasyNanopore can not only realize the efficient and automatic detection and analysis of translocation events,and the effect is highly consistent with manual analysis.
Keywords/Search Tags:Deep learning, Nanopore, Basecalling, Sequencing technology
PDF Full Text Request
Related items