Font Size: a A A

Nanopore Basecalling Based On Conditional Random Field

Posted on:2023-12-18Degree:MasterType:Thesis
Country:ChinaCandidate:L A DuFull Text:PDF
GTID:2530306848458074Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The research and development of gene sequencing is of great significance in the field of life sciences.The nanopore sequencing platform has many advantages such as real-time sequencing,easy portability,long read length,and low cost,but the high sequencing error rate hinders its application in many research fields,such as disease research,especially low-frequency mutations in tumors,and liquid biopsy.and other fields.The nanopore sequencing platform uses deep learning methods for basecalling.The purpose of this paper is to establish a fast and accurate base calling model to identify base sequences directly through current signals.In this paper,the idea of Conditional Random Field(CRF)is introduced into the basecalling of nanopore sequencing signals,and a CRF model,loss function and decoding function.for nanopore sequencing is implemented.And through using Fully Convolutional Neural Networks(FCNN)for signal feature extraction the speed of base recognition is significantly improved.This paper mainly completes the following work:(1)Introducing the idea of CRF into the basecalling problem to solve the problem of assuming conditional independence in Connectionist Temporal Classification(CTC).In this paper,the long short-term memory(LSTM)network structure is used to compare the accuracy of the CRF and CTC methods,and confirme that the CRF is more suitable for the base recognition problem than the CTC.CRF reach 95.87% mean and 96.46%median accuracy,significantly higher than CTC’s 93.91% mean and 96.28% median accuracy.(2)A U-net structure network of Fully Convolutional Neural Networks is implemented to extract the features of nanopore signals,and batch normalization and GELU activation function are used in the model to obtain better basecalling result.Compared with the CNN + LSTM structure,it is confirmed that the U-net + LSTM structure can complete the basecalling task well.Although the U-net only model in this paper is slightly lower than the 96.74% mean and 97.33% median of the LSTM on the test data set,the speed is much faster than the LSTM structure.,the training time is only half of LSTM.(3)Test the model using actual sequencing data.Experiments using the actual sequencing data of Coli_S10,Lambda,KP_NUH29 species show that the CRF model of the U-net structure can achieve good results in basecalling problems.The accuracy rate of the model in this paper at the sequence level is slightly lower than the LSTM model,and the accuracy rate reaches about 94%.At the same time,experiments show that using the U-net network structure alone does not significantly reduce the accuracy while the basecalling speed is more than twice of the LSTM network.
Keywords/Search Tags:Nanopore sequencing, Basecalling, Convolutional Neural Network, Conditional Random Fields
PDF Full Text Request
Related items