Font Size: a A A

Variation Analysis Of Single-Molecule Real-time Sequencing Data Based On Deep Learning

Posted on:2022-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y X LvFull Text:PDF
GTID:2480306602456774Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous advancement of biotechnology,the study of diseases at human genetic level,has become the focus of research.Single-molecule realtime sequencing method as a new gene sequencing technology has also emerged.At present,this method has catching on people's eyes and becoming the central issue.Compared with the Next Generation Sequencing(NGS)data,singlemolecule,real-time sequencing data can characterize more chromosomal structural variations(SVs).Therefore,single-molecule,real-time sequencing data has greater advantages in calling variations.However,the high noise of single-molecule,real-time sequencing data makes it difficult for existing tools to accurately detect structural variation from the data.Therefore,for singlemolecule,real-time sequencing data,a calling method for deletions has been given and including four points:(1)Calibration of VCF files.In VCF(Variant Call Format)file,the structural variation of corresponding samples was recorded.However,due to the inaccurate breakpoint of structural variation recorded,there were many false positive images in the genomic images generated in subsequent experiments.Therefore,in view of this situation,the structural variation recorded in the VCF file is extracted with features,and machine learning method is used for correction,which is helpful for generating accurate genome images and training credible deep learning.(2)Generate images of single molecule data.By studying the single molecule data,the information in the single molecule data is retained completely and mapped into images.Images can connect isolated sequences and reflect the location characteristics between sequences.At the same time,images are also the input of deep learning methods.Therefore,the study on the color of genomic images and the mapping way of the sequence into a high-quality gene image with certain rules can not only reflect the spatial and positional of sequences,but also pave the way for the subsequent deep learning experiments.(3)Image amplification.Because the sequencing data is shortage,we can't get enough images.To increase the amount of data and balance the input samples,the generated adversarial network is used to amplify positive sample images.In addition,compared with using simulation data to generate additional deletion images,the randomness of the deletion images can be increased by using GAN method,which provides a more reliable CNN model for subsequent identification of unknown genome images.(4)Research on integrated calling methods for deletions.In order to improve the accuracy of calling deletions,by integrating the results of Sniffles,NextSV,SVIM,Picky and SMRT-SV as candidate sets.After generating the images,using the trained CNN model to distinguish and get the final result.It has better result compare with other tools.It has a higher F1-score on real data or simulation data,no matter in low coverage depth data or high coverage depth data and no matter in calling short deletions or long deletions.
Keywords/Search Tags:Real-time single-molecule sequencing, structural variation detection, machine learning, sequence visualization, convolutional neural network, Generative Adversarial Network
PDF Full Text Request
Related items