Font Size: a A A

Study Of Nanopore Sequencing Data Analysis Methods

Posted on:2021-06-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q ZhangFull Text:PDF
GTID:2480306047499474Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
DNA is the natural instruction for human beings and all life.Since the genetic material status of DNA has been established,a series of genome sequencing work has been carried out.Gene sequencing technology has many applications in many fields,such as helping to achieve precision medicine,improving crops,identification,curing cancers and so on.So the research and development of gene sequencing has great significance.From the first generation of sequencing technology deoxygenation chain termination method in 1970 s,to the more accurate second generation of high-throughput sequencing,and to the third generation of single molecule sequencing represented by nanopore sequencing.As time goes by,gene sequencing technology is becoming more and more mature and perfect.The latest nanopore sequencing technology identifies base sequences by current signals,which has many advantages such as low cost,high speed,easy to carry,real-time,long reading and so on.However,due to its new sequencing method,the accuracy of nanopore sequencing still needs to be improved.Referring to speech recognition,an end-to-end basecalling method without current signal segmentation is introduced in the analysis of nanopore sequencing data.Firstly,we perform a series of preprocessing work on the sequencing data of ? phage which are generated by nanopore sequencer Min ION,including quality control,length screening,and error correction by comparing to the reference.And then the data set needed for training model is constructed,we divided the data set into training set,verification set and test set,which are respectively used for model training and model effect test.In terms of the identified model,the convolutional neural network(CNN)and the long short memory network(LSTM)are combined as the basic forward propagation network structure,and the connectionist temporal classification(CTC)is used as the loss and decoding mechanism.In the process of model training,different network layers,neuron numbers and convolution kernel sizes are adjusted and compared in order to select appropriate super parameters to optimize the model performance.On this basis,attention mechanism and batch normalization are introduced to further optimize the model,and the effect of the model before and after optimization had been compared and analyzed.Finally,this paper introduces the idea of ensembling,we assemble the basic model and attention model through the weight parameter,and compare the effect of the assembled model under different weight parameters.The experimental results show that the model constructed in this paper has good performance of nanopore sequencing data basecalling.
Keywords/Search Tags:Nanopore sequencing, Basecalling, Neural network, Batch normalization
PDF Full Text Request
Related items