Font Size: a A A

Study On Statistical Methods For Cancer Genome Sequencing Data

Posted on:2013-07-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:X HuaFull Text:PDF
GTID:1220330377451867Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Over the past several years, with the development of next-generation se-quencing (NGS) technology, several platforms based on cyclic-array sequencing technology appeared and have been widely used. Individual investigators now can pursue their projects which were accessible only to major genome centers in the past. The next-generation sequencing technology has been widely used in biological research and made significant scientific achievements. Compared to the traditional sequencing technology, the next-generation sequencing technology dra-matically reduces the sequencing cost, and significantly improves the sequencing efficiency. Meanwhile, it still has some disadvantages such as shorter sequencing length and higher sequencing error rate.It is more easily than before for us to obtain high amount of sequencing data. As the NGS experiments continue to generate huge amount of cancer genome se-quencing data, substantial challenges exist for analyzing these NGS data. These challenges include how to perform efficient statistical analysis, and how to obtain accurate statistical inference and effective statistical tests. To identify the land-scape of somatic mutations in lung cancer from whole-exome sequencing data, we carried out the research work as following:inferring the genotype of a specified lo-cus in a given sample based on sequencing data; estimating the mutation rate and loss of heterozygosity rate at a specified locus; testing specified locus for somatic mutation; identifying driver genes whose somatic mutations may play important roles in the maintenance of cancer phenotype and exploring the interaction effect between driver genes.There are two main difficulties of genotype inferring based on gene sequencing data:sequencing error and mixture of DNA sample. However, existing software for genotype inference generally use Bayesian discriminant analysis based on bi- nomial distribution and do not take the sample mixture into account, which may underestimate the mutation rate and miss the real variant loci. Our approach introduces several parameters such as the mutation rate of each locus, sequence error rate of each locus, and the mixing ratio of each tumor sample, and devel-ops a likelihood model based on binomial distribution for each locus or for each sample. Then the maximum likelihood estimations of parameters are obtained by the expectation-maximization (EM) method, and the genotypes are inferred by posterior probabilities. The simulation results show that our method has a higher accuracy than traditional Bayesian method, and the EM algorithm has a shorter running time than other estimating methods. It also proves the necessity and rationality of estimating the composition rate parameter. To demonstrate the utility of our new method, we applied it to analyze whole-exome sequencing data from249lung tumor samples. When taking the composition rate into account, our method not only found most of somatic mutations that were also detected by existing methods, but also identified a large number of novel variants that may be lung tumorigenesis.Our approach can also get the maximum likelihood estimation of the mu-tation rate at the same time when it infers the genotype for each locus. Then a likelihood ratio test of the parameter of mutation rate is performed to test whether a loci is a somatic SNV (Sequence Nucleotide Variation). The simulation results show several factors influencing the power of our test, and also prove the rationality of our test and the effectivity of our iterative algorithm that based on maximum likelihood estimation. The real data results show that we found some new loci that might be somatic SNVs. Similarly, we introduce the loss of heterozygosity rate as a parameter of each locus in our model, and then the corre-sponding maximum likelihood estimation and likelihood ratio test are preformed. But it still need more research for taking the LOH rate into account in real data analysis.In order to find the driver gene that may drive the cancer phenotype, we first classify somatic mutations into different categories based on the functional consequences and the mutated base-pair, then count the number of all different types of gene mutations in each tumor sample. We take the mutation type, the gene length and the individual background mutation rate into account, and build a likelihood model based on Poisson distribution. We introduce a offset coefficient for the mixture χ2distribution, which is the distribution of the statistic of multiple variable likelihood ratio test under boundary condition under the null hypothesis. The simulation results show that, our method has a higher power than some existing methods based on Bernoulli distribution. Real data results show that our method find more driver genes, which could also be verified in biology. Similarly, our method is flexible and can be extend to test driven pathways or gene sets.It has been very challenging to study the interaction between genes in recent years. We introduce a Monte Carlo simulation procedure to study interaction between two genes. However, to find the interaction effects of third or higher order is relatively difficult. The multi-dimensionality reduction method may be introduced to handle this problem. The simulation results show that our method performs better than a permutation test method by excluding the confounding factor of gene length. The real data results show that the interaction analysis can help us finding the key gene in some cancer pathways.The data studied in this paper include the whole exome sequencing data of249lung cancer patients from The Cancer Genome Altas (TCGA), and the corresponding count data of mutation numbers of each gene and each sample.
Keywords/Search Tags:Next-generation Gene Sequencing, Genotype Inferring, Somatic Mu-tation, Loss of Heterozygosity, Tumor and Normal Composition, Driver Gene, Gene Interaction, EM Method, Likelihood Ratio Test, Monte Carlo Simulation
PDF Full Text Request
Related items