Detection Of Copy Number Variation Based On Isolation Forest And Total Variation

Posted on:2020-02-29

Degree:Master

Type:Thesis

Country:China

Candidate:J A Yu

Full Text:PDF

GTID:2370330602952150

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

As sequencing technology continues to evolve,sequencing speeds are getting faster.After the widespread use of second-generation sequencing technology,human genome sequencing data is growing at an explosive rate,which greatly promotes research progress in related fields.Although the widespread use of second-generation sequencing technology has accumulated a large amount of human genome sequencing data,obtaining these sequencing data is only the basis for obtaining effective information in the genome.Our ultimate goal is to study the arrangement and combination of these base pairs and individuals.Analyzing and studying genomic data can help us understand how human genes work.The second-generation sequencing data has the characteristics of short fragments and high complexity,which poses a higher challenge for genetic data analysis.The variation patterns of the human genome are diverse,and the length of mutation varies from the level of a single base to the level of a chromosome.How to accurately detect the type and region of variation is the focus and difficulty in genomic data analysis.Various detection algorithms have been developed to detect various variations.Among the various variants of the human genome,copy number variation has been shown to be closely related to cancer disease.Therefore,this article focuses on copy number variation detection.In the existing single-sample copy number detection algorithm,these algorithms have been able to achieve good results if the samples are high-coverage and high tumor purity.However,if the sample is low coverage and low tumor purity,these algorithms are relatively low in terms of detection accuracy,sensitivity and F1 score.This paper focuses on single-sample,low-coverage and low-tumor purity sequencing data.Aiming at the poor detection results of the existing copy number mutation detection algorithm under low coverage and low tumor purity,this paper proposes a copy-variation detection method CNV_IFTV based on isolation forest,which uses the nonlinear mapping of the tree model.The advantages of the ensemble learning algorithm effectively characterize the RD information.At the same time,in the process of modeling,the algorithm is related to the content and ordering of RD values in the sample,and has nothing to do with its absolute difference,which effectively solves the problem of data imbalance in copy number variation detection.In addition,by using the total variation denoise model,the correlation between the adjacent positions of the sliding window is added to the model,so that the anomaly score is more reliable as a measure of the abnormality of the RD value.After the anomaly score is obtained,a threshold is needed to determine whether the copy number variation occurs in each sliding window.In this paper,the threshold is automatically selected by the Ostu’s method.Finally,through the related experiments,it is verified that the CNV_IFTV algorithm proposed in this paper has a good detection effect on the simulation data and real data.

Keywords/Search Tags:

second generation sequencing, copy number variation, isolation forest, total variation

PDF Full Text Request

Related items

1	Comprehensive Detection Method Of Copy Number Variation And Its Boundary For Next-generation Sequencing Data
2	Detection Of Copy Number Variation Based On Statistical Examination
3	Detection Of Tumor Copy Number Variation And Inference Of Subclonal Populations Based On Next-generation Sequencing Dat
4	Detection Algorithms Of Genomic Copy Number Variation Based On Low Coverage Sequencing Data
5	Detection Of Copy Number Variation Based On One-class Support Vector Machine
6	Algorithms For Genomestructural Variation Prediction
7	Establishment And Applicability Verification Of A Novel Technology For Single-cell Copy Number Variation Sequencing
8	Studies Of New Techniques For Nucleic Acid Test Using Next Generation Sequencing Platforms
9	Detection Of Copy Number Variants Based On Genome Sequencing Data
10	Research On Detection Of DNA Copy Number Variation Based On Read Depth Method