Font Size: a A A

Detection Of Copy Number Variation Based On Isolation Forest And Total Variation

Posted on:2020-02-29Degree:MasterType:Thesis
Country:ChinaCandidate:J A YuFull Text:PDF
GTID:2370330602952150Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As sequencing technology continues to evolve,sequencing speeds are getting faster.After the widespread use of second-generation sequencing technology,human genome sequencing data is growing at an explosive rate,which greatly promotes research progress in related fields.Although the widespread use of second-generation sequencing technology has accumulated a large amount of human genome sequencing data,obtaining these sequencing data is only the basis for obtaining effective information in the genome.Our ultimate goal is to study the arrangement and combination of these base pairs and individuals.Analyzing and studying genomic data can help us understand how human genes work.The second-generation sequencing data has the characteristics of short fragments and high complexity,which poses a higher challenge for genetic data analysis.The variation patterns of the human genome are diverse,and the length of mutation varies from the level of a single base to the level of a chromosome.How to accurately detect the type and region of variation is the focus and difficulty in genomic data analysis.Various detection algorithms have been developed to detect various variations.Among the various variants of the human genome,copy number variation has been shown to be closely related to cancer disease.Therefore,this article focuses on copy number variation detection.In the existing single-sample copy number detection algorithm,these algorithms have been able to achieve good results if the samples are high-coverage and high tumor purity.However,if the sample is low coverage and low tumor purity,these algorithms are relatively low in terms of detection accuracy,sensitivity and F1 score.This paper focuses on single-sample,low-coverage and low-tumor purity sequencing data.Aiming at the poor detection results of the existing copy number mutation detection algorithm under low coverage and low tumor purity,this paper proposes a copy-variation detection method CNV_IFTV based on isolation forest,which uses the nonlinear mapping of the tree model.The advantages of the ensemble learning algorithm effectively characterize the RD information.At the same time,in the process of modeling,the algorithm is related to the content and ordering of RD values in the sample,and has nothing to do with its absolute difference,which effectively solves the problem of data imbalance in copy number variation detection.In addition,by using the total variation denoise model,the correlation between the adjacent positions of the sliding window is added to the model,so that the anomaly score is more reliable as a measure of the abnormality of the RD value.After the anomaly score is obtained,a threshold is needed to determine whether the copy number variation occurs in each sliding window.In this paper,the threshold is automatically selected by the Ostu's method.Finally,through the related experiments,it is verified that the CNV_IFTV algorithm proposed in this paper has a good detection effect on the simulation data and real data.
Keywords/Search Tags:second generation sequencing, copy number variation, isolation forest, total variation
PDF Full Text Request
Related items