Font Size: a A A

Reseach On Algorithms Of Genome-wide Association Study For Complex Diseases Based On Maximal Information Coefficient

Posted on:2016-05-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:H M LiuFull Text:PDF
GTID:1224330482474737Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
After implementing the Human Genome Project(HGP), a Genome-wide Association Study(GWAS) was proposed to sequence and analyze the whole DNA and genome for complex traits. The purposes of GWAS are to find the variants of genes and single nucleotide polymorphisms(SNPs), to study and take the sensitive regions in genome and the risk genes of a disease, and to search the biomarkers for diseases’ early diagnosis and personalized therapy, developing new drugs and specific prevention. The studies are calssified as association studies implemented by repeated validation in multi-center and large sample based on whole genome for insight into the genes affecting the occurrence, development and treatment of a disease. So far, many of algorithms or tools have been developed for GWAS, which are showing promising. They were demonstrated to have the merits on computation and statistics. A study, however, indicated that those methods show indefinite on general data sets. Furthermore, the huge and discrete features of GWAS data make the existing methods unsatisfactory in aspects of efficiency, power and false positive rate. Thus, it is a challenge for bioinformatic workers to develop new and effective algorithms for GWAS. To this end, the following researches were done in this article:1) Analyzed the Maximal Information Coefficient(MIC). MIC is a novel statistical method which satisfies the generality and equitability in correlation analysis. It greatly outperforms a Pearson’s coefficient, Spearman’s coefficient, mutual information, CorGC and maximal correlation. In this study, the principle of MIC was discussed, a recurrence equation in MIC was proved in mathematics, and the implementation of MIC was shown in detail. The shortcomings of MIC were analyzed when MIC was introduced into GWAS, and finally the feasibility applying MIC to GWAS was discussed..2) Presented a new algorithm, MICSNPs for searching SNP-disease associations. The novel algorithm used Monte Carlo(MC) permutation test to map MIC-values onto P-values, which eliminates the effects of MIC fluctuation. In order to save run-time, a sliding window-based binary search was designed in MICSNPs, whose run-time was 0.58% of that of sequence search in our experiments. For making compromise among statistical power, false positive rate and run-time, the relation between MC sampling and the three indexes was studied in this paper. The results showed that the best MC sampling count is 2-4 times the number of the biomarkers involved in a genotype data set. The best MC sampling count was irrelevant to the sample size of a data. Testing on a real data set and simulation data sets showed, with 4 times the number of biomarkers as resampling count, MICSNPs was feasible and effective on computation and statistics. The experiments indicated that MICSNPs outperforms the existing methods.3) Presented another new algorithm named mBoMIC for searching SNP-disease associations. Firstly, this paper proposed a modified Bagging(mBagging) algorithm, which changed the same counts of Bootstrap sampling for bagged and out-of-bagged data sets in traditional Bagging algorithm into the different sampling counts and let the count of bagged sets less than that of out-of-bagged sets. The minor bagged sets decrease the computational complexity of Bagging while keeping an appropriate statistical power. The major out-of bagged sets improves the power further. Thus, mBagging reducethe time cost with improving statistical powers. Moreover, the minor bagged sets reduce the over-fitting of a traditional Bagging and make mBagging have less false positive rate than the traditional Bagging. The main contribution of mBagging algorithm lays in improving the incompatible three measures simultaneously, i.e. statistical power, false positive rate and time cost. By combining the mBagging and MIC, a modified Bagging of maximal information coefficient(mBoMIC) was formed to search SNP-disease association, which integrated the advantages of Bagging and MIC, and may overcome the weaknesses of MIC’s low statistical power and the fluctuation of MIC-values. Using 500 groups of genotype data as the experimental objects and 20, 400 as the counts of bagged and out-of-bagged sets respectively, mBagging reduced the time cost by 80.3%, increased the statistical power by 15.2% and decreased the false positive rate by 31.3%, comparing with a Bagging with resampling of 400. The results testing on simulated and real data suggested that mBoMIC has better statistical power than other existing methods, which indicates the feasibility of mBoMIC in biomarker selection. 4) Constructed an algorithm for identifying differentially expressed genes/microRNAs based on MIC. GWAS explores not only genotype data sets but also gene/microRNA expression profiles in whole genomic range. This paper constructed a new method to mine the genes/microRNAs associated with disease risks from microarray gene/microRNA expression data sets. Based on the new method, a gene expression data set with atrial fibrillation-control and a microRNA expression profile with valvular heart disease-control were analyzed respectively. Total 41 differentially expressed genes were identified by our method, in which 14 genes are new findings. The analysis of signaling pathways and enrichments showed that these new genes are strongly associated with atrial fibrillation. Furthermore, 2 strongly differentially expressed microRNAs were screened in this works, in which hsa-miR-221* is a new finding compared with other related works.In this study, MIC was successfully introduced into GWAS and proposed two new algorithms: MICSNPs and m BoMIC for the search of SNP-disease associations through overcoming the defects of MIC. Moreover, MIC was studied for analysis of gene/microRAN expression profiles. These algorithms could be the novel computing methods in searching and identifying the biomarkers associated with complex diseases.
Keywords/Search Tags:genome-wide association study(GWAS), maximal information coefficient(MIC), biomarker, computational complexity, statistical performance
PDF Full Text Request
Related items