| The detection of pathological characteristics such as tumor purity,ploidy and copy number variation plays an important role in finding pathogenic genes and related treatment methods.With the advancement of experimental technology,next-generation sequencing technology(NGS)has been widely used in cancer genomics research due to its advantages of high throughput,high resolution and low cost.The sequencing fragments obtained by next-generation sequencing technology are relatively short,so the amount of data is very large.In order to effectively analyze this type of data,the following three problems need to be solved:(1)How to extract the characteristics of the sequencing data.(2)What method is used to analyze the sequencing data.(3)There is a strong correlation between the purity of the tumor,the aneuploidy of the genome,and the copy number variation.How to properly quantify the relationship between them.The existing related algorithms cannot achieve satisfactory results when the sequencing data coverage and tumor purity are low.At the same time,many copy number mutation detection algorithms mainly rely on the read depth signal of the window for abnormal analysis,and do not effectively introduce other information,which will not be able to fully capture the characteristics of the window.Based on previous studies,this paper designs two algorithms to solve the above problems in combination with the characteristics of next-generation sequencing data.The main research contents and results are as follows:1.This paper proposes a method CNV_LGB to detect copy number variation from shortread sequencing data.It uses a method of extracting window features and introduces the machine learning model lightGBM to classify abnormal windows.Specifically,CNV_LGB is a method based on the read-depth strategy.Firstly,CNV_LGB performs a general preprocessing on the sequencing data.Secondly,CNV_LGB extracts multiple features for each window,and uses the existing detection model to obtain some of the more reliable regions of variation and normal regions,and then adds these regions as labels to the data set.Finally,the supervised machine learning model lightGBM is used to classify the abnormal window,and the abnormal window is used to determine the copy number variation area.The advantages of CNV_LGB are mainly in the following two aspects: 1)Transforming an unsupervised anomaly detection method into a supervised imbalanced classification method helps to overcome the influence of abnormal data on the results of the algorithm.2)Extracting multiple features from the sequencing data can capture the characteristics of the window from multiple dimensions,thereby ensuring the accuracy of the classification results.2.A detection algorithm Turp Aplo for tumor purity and average ploidy is proposed.Specifically,this method first locates the copy number deletion mutation area,and then determines the specific deletion type by comparing the difference in the read depth signal.Finally,the expaected reding depth,the observed reading depth,tumor purity and average ploidy are correlated,and the purity and average ploidy of the tumor samples are iteratively calculated using the characteristics of loss of heterozygosity.The advantage of this algorithm is that there are only two types of copy number deletion mutation regions: homogeneity loss and heterogeneity loss,so a more concise model can be established,thereby speeding up the detection efficiency.After verification by comparative experiments,the algorithm performs well on both simulated data and real data. |