Font Size: a A A

Research And Application On Co-Clustering Algorithms For High Dimensional And Very Large Data

Posted on:2011-11-06Degree:MasterType:Thesis
Country:ChinaCandidate:C Y YeFull Text:PDF
GTID:2189360305468936Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Co-clustering is rather a recent paradigm for unsupervised data analysis, but it has become increasingly popular because of its potential to discover latent local patterns, otherwise unapparent by usual unsupervised algorithms such as k-means. Wide deployment of co-clustering, however, requires addressing a number of practical challenges such as data transformation, cluster initialization, scalability, and so on. Therefore, this thesis focuses on developing sophisticated co-clustering methodologies to maturity and its ultimate goal is to promote co-clustering as an invaluable and indispensable unsupervised analysis tool for varied practical applications. To achieve this goal, we explore the three specific tasks:(1) development of co-clustering algorithms to be functional, adaptable, and scalable; (2) extension of co-clustering algorithms to incorporate application-specific requirements; (3) application of co-clustering algorithms broadly to existing and emerging problems in practical application domains.As for co-clustering algorithms, we propose an improved Bayesian co-clustering algorithm. It allows a mixed cluster between rows and columns, meaning that the clustering of the objects belong to a cluster, but also belong to another cluster. This algorithm utilizes exponential family of probability distribution theory to find the generated clusters through co-clustering. At the same time, in order to automatically estimate the number of rows and columns of the cluster, we also proposed based on Bayesian information criterion algorithm to estimate the number of categories.Concerning co-clustering extensions, we propose based on fast co-clustering framework of gradually correspondence analysis method for the general co-clustering method. It does not require the whole data matrix be in main memory. This is crucial for high dimensional and large datasets. It can be implemented using different algorithms such as k-means, information-theoretic and Bayesian co-clustering methods. It implements faster than previous methods, but it achieves comparable accuracy to other methods.Regarding co-clustering applications, we extend the functionality of Bayesian co-clustering algorithm to incorporate application-specific requirements. Based on Bayesian co-clustering algorithm of gradually correspondence analysis method, it can find consistent co-cluster from high dimensional and large datasets. Its purpose is to select rows and columns, and then simultaneously clustering rows and columns through Bayesian co-clustering algorithm. Finally, we describe the results of the framework of the algorithm is applied to various simulated and real data.In summary, we present co-clustering algorithms to discover latent local patterns, propose their algorithmic extensions to incorporate specific requirements, and provide their applications to a wide range of practical domains.
Keywords/Search Tags:high dimensional and large data, correspondence analysis, co-clustering, Bayesian co-clustering
PDF Full Text Request
Related items