| With the completion of the sequencing of Human Genome Project,the functional analysis of the large-scale genome data has become one of the most popular research topics.In recent years,due to the discovery of the close relationship between genes and complex diseases,how to find out diagnosic genes related to disease phenotypes over large-scale genome data is becoming increasingly important.Although the techniques of mining diagnositc genes from genome data have largely progressed,the discovered genes can only explain a small portion of causes.As deepening as research,more and more researchers find that complex diseases such as cancer,hypertension,Alzheimer’s disease,etc,are more affected by the interaction among multiple diagnostic genes,while previous work just ignored the interaction between genes.In addition,due to the large-scale geneome data inherently having the characteristics of rich species,high dimension and growing speed,it is easy to cause the mining algorithms with high computational complexity and the mining results with weak statistical significance,hard to interpretation and repeat.All of above mentioned reasons have brought geat challenges for efficiently and effectively mining diagnostic genes over large-scale genome data.In this dissertation,we focus on the techniques for mining diagnostic gene interactive patterns,aiming at practical requirement,models the data into sequential or graph models,further propose three important diagnostic gene interactive patterns and their corresponding mining frameworks with high efficiency and effectiveness.The main contributions of the dissertation are as follows:(1)Finding susceptible and protective interaction patterns in large-scale genetic association study.Various approaches have been developed for finding significant genetic interactions and can analyze whether the discovered patterns are related to some disease phenotype,however,they cannot tell how these patterns related to the diseases,for example,leading to diseases or inhibiting the occurrence of diseases.Thus,this dissertation proposes new susceptiable and protective interactive genotype patterns and the corresponding mining framework,which provides a better prospective to uncover the underlying relevance between genetic variants and complex diseases.The proposed framework first uses the proposed method to filter the whole genotype data and find out the hot regions highly associated with the disease.Then,it utilizes the global and local two-level sample enumeration tree structure to discover the susceptible genotype patterns and the corresponding protective patterns.Also,several pruning rules are designed to further reduce the searching space and thus accelerating the pattern mining process.Finally,extensive experiments on a large number of real datasets verify the efficieny and effectiveness of the proposed framework.(2)ELM-based large scale genetic association study via statistically significant pattern.Due to the lack of multiple hypothesis test correction,the existing methods are easy to generate more false positive results.Thus,this dissertation proposes a new statistically significant pattern,which considers both family-wise error rate(FWER)and false discovery rate(FDR)and the corresponding mining framework.This framework first utilize the designed upper bound of χ~2 test to speed up FWER-constrained salistically significant pattern mining in a row enumeration way.Then,a space-effecive grid index is devised.It dramatically improves the efficiency of FDR-constrained pattern discovery by grouping massive patterns with the same significance together.In addition,an ELM classifier is constructed based on the significant patterns as feature vectors.Extensive experiments on different real genotype datasets show much higher efficiency and effectiveness of our proposed framework.(3)k-vertex connected component detection over large-scale genetic interaction networks.Finding components with high connectivity is an important problem in the analysis of large-scale genetic interaction networks.In particular,k-edge connected component(k-ECC)has recently been extensively studied to discover disjoint components.Yet many real applications present more needs and challenges for overlapping components.This dissertation proposes a k-vertex connected component(k-VCC)model,which is much more cohesive and therefore allows overlapping between components.To find k-VCCs,a top-down framework is first developed to find the exact k-VCCs in polynomial time.To further reduce the high computational cost for input networks of large sizes,a bottom-up framework is then proposed.Instead of using the structure of the entire network,it locally identifies the seed subgraphs,and obtains the heuristic k-VCCs by expanding and merging these seed subgraphs.Comprehensive experimental results show the efficiency and effectiveness of the proposed approaches.The studies in this dissertation can solve different requests for diagnostic gene interactive pattern mining over large-scale genome data,and provide a new perspective for the studies. |