| Genomic variants are related to human diseases/traits,rational analysis of variants is of great significance to the study of pathogenesis.Copy number variation(CNV)is a major form of genomic variation which accounts for a considerable proportion of cancer genome variation.Length of CNVs ranges from several kilobases(KB)to several million bases(MB)or even longer.Numerous studies have shown that genomic CNVs contains cancer driver genes or suppressor genes.The accurate detection of CNVs in tumor samples can provide critical information for the analysis and diagnosis of cancer diseases.At present,precise cancer treatment based on gene variation analysis has become a routine method.However,for a sample of cancer patient,doctors often extract cancer cell tissue(which may contain a proportion of normal cell tissue)and obtain relevant information through sequencing technology which provide a data basis for CNV analysis.Considering the cost of sequencing,paracancerous tissue(ie,normal cells of the patient)may not always be extracted.Therefore,it is of great significance to study CNV detection methods for single tumor sequencing samples.Next-generation sequencing(NGS)provides high-resolution sequencing data,which has natural data advantages for the detection of CNVs.However,the large volume and complexity of NGS data,as well as the complexity of CNV structures,bring great challenges to the accurate detection of CNVs.For NGS data,the method based on Read Depth(RD)is the most widely used CNV detection method at present which distinguish CNVs based on the abnormal Read Count(ie,the number of successful alignments for each position)obtained by aligning NGS data with standard sequences.This thesis takes NGS data as the background to study an RD-based CNV detection method for single tumor samples.The main work includes the following two aspects:(1)A CNV detection method CNV_OCSVM based on One-Class Support Vector Machine(One-Class SVM)is designed which abstracts the CNV detection problem into a singleclassification problem.This method first compares NGS data with reference sequences to obtain RD information as a sample set.Randomly select a certain length of RD value and corresponding position information from the sample set as a training set,train a two-dimensional One-Class SVM decision model,and use the model to predict all sample points to obtain abnormal points.The abnormal points obtained by each detection are eliminated in the next sampling,and the ”layer-by-layer peeling” of the abnormal points in the sample set is realized through multiple iterations of the above process,and finally the adjacent abnormal points are merged to obtain the final CNV region.Since the number of sample points where CNV occurs is much less than the normal sample points,random sampling of the entire sample can usually reflect the spatial features of the normal sample points in the sample set,and at the same time solve the problem that the sample set is too large to effectively train One-Class SVM decision model.The ”hard-margin” support vector data description(SVDD)model without slack variables is used in the training process,which ensures the efficiency of multiple iterations of the algorithm,and also avoids the selection of penalty parameters in the original SVDD model.(2)The CNV detection performance of the CNV_OCSVM method is verified through simulation data experiments and real data experiments.The simulation experiment results show that this method has the highest recall on most simulation datasets,and can guarantee high accuracy,which verifies that this method can detect more variation regions with insignificant differences in RD values,and has better tolerance for the boundary position of random disturbance in sequencing data.It can effectively reduce the misjudgment of normal sample points in the disturbance area.In the real data experiment,the number of CNV records detected by this method is small,but the degree of coincidence with the detection results of the comparison method is high,indicating that this method has certain advantages in accuracy compared with other methods and CNV regions are of high quality and reliability. |