Font Size: a A A

The Strategy Of Choosing Variants To Correct For Population Stratification By Principal Component Analysis

Posted on:2016-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:L W ZhangFull Text:PDF
GTID:2284330461493269Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Genome-wide association studies(GWAS) have been successful in identifying thousands of susceptible variants related with complex diseases or observable traits. We all know that population stratification is an important issue in the genome-wide association analysis. It may produce spurious association if the confounding effect is not appropriately controlled. Hence, it is necessary to consider population genetic structure and control population stratification in the association study. At present, various methods have been proposed to solve this problem, such as genomic control(GC), principal component analysis(PCA) and mixed models.PCA, proposed by Hotelling in 1933, is one of the most classic multivariate statistical analysis methods. It is a statistical method for exploring datasets with a large number of variables(dimensions) by dimension reduction, so that it can effectively extract useful information. During the era of GWAS, PCA is widely used in detecting population substructure, correcting for population stratification in disease studies and making qualified inferences about human history.In this study, we use low-coverage and high-coverage whole genome sequencing data downloaded from “1000 Genome ” to investigate whether PCs based on different kinds of variation sites could recognize three continental groups(EUR,ASN and AFR), especially for EUR and ASN,who are closer in genetic distance. It aims to provide strategies for researchers to choose variants in correcting for population stratification in genome-wide association analysis. Meanwhile, we present the origin and evolutionary process about these populations in depth.The main contents are as follows:(1) We utilize the chromosome 1 data of the low-coverage whole genome sequencing(WGS) dataset released in August of 2010 in the website of “1000 Genome ”. Then gain the same variant sites through matching variants from different ancestries, classify these variants into three kinds:common variants(CVs, with MAF> 5%), low frequency variation(LFVs, 1%≤MAF≤ 5%), rare variants(RVs,MAFs<1%). We construct PCs basing on different variant sites and their combinations to detect their performance in separating different populations.(2) We make use of all chromosomes data of low-coverage WGS dataset to explore the use of PCA in recognizing these populations again. Data preprocessing is similar to the above for every chromosome. Then we combine all the shared variants of 22 chromosomes. Finally we classify the whole-genome data of the three populations into three kinds: CVs, LFVs and RVs. Similarly, we construct PCs basing on different variant sites and their combinations to detect their performance in separating different populations.(3) We also use high-coverage WGS dataset released in June of 2011 to explore the use of PCA in recognizing these populations, but we only choose 5 chromosomes because of its extreme high dimensions and high computational requirements, they are chromosome 1,5,10,15 and 20 respectively. The procedure of data preprocessing and the way of constructing PCs is similar to the above two.The main results of the study are as follows:(1) Results of chromosome 1 data of low-coverage WGS data:Only the top two PCs based on CVs or LFVs alone could separate EUR, ASN and AFR very well, and the performance of CVs is slightly better than LFVs; while RVs have limited classification ability, and the unsatisfied performance can not be improved even by extracting more PCs. Additionaly, we construct PCs based combinations of different variants, such as CVs+LFVs, CVs+RVs and CVs+LFVs+RVs, the performance of the three combinations in recognizing different ancestries is similar to CVs alone, but have obvious improvement compared with LFVs alone. Meanwhile, we choose CVs for classifying subpopulations in each continental groups because of its optimal performance in separating different continental populations. It shows that CVs contain enough substructure genetic information, because it can recognize the outline of different subpopulations, especially for AFR.(2) Results of low-coverage WGS data:We find consistent results as above in all the cases. It is worth mentioning that the performance of classifying populations improves when using the low-coverage WGS data than only chromosome 1 data in most cases.(3) Results of high-coverage WGS data:The results are consistent with the above two. We also note that every classified populations is more concentrated, especially for EUR and ASN, whose population homogeneity is superior to AFR. Meanwhile, the performance of RVs is better than its performance in low-coverage WGS data, because it can separate AFR from non-AFR obviously.
Keywords/Search Tags:1000 Genome Project, sequencing data, population stratification, rare variants, principal component analysis, Euclidean distance
PDF Full Text Request
Related items