Research And Application On Co-Clustering Algorithms For High Dimensional And Very Large Data

Posted on:2011-11-06

Degree:Master

Type:Thesis

Country:China

Candidate:C Y Ye

Full Text:PDF

GTID:2189360305468936

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

Co-clustering is rather a recent paradigm for unsupervised data analysis, but it has become increasingly popular because of its potential to discover latent local patterns, otherwise unapparent by usual unsupervised algorithms such as k-means. Wide deployment of co-clustering, however, requires addressing a number of practical challenges such as data transformation, cluster initialization, scalability, and so on. Therefore, this thesis focuses on developing sophisticated co-clustering methodologies to maturity and its ultimate goal is to promote co-clustering as an invaluable and indispensable unsupervised analysis tool for varied practical applications. To achieve this goal, we explore the three specific tasks:(1) development of co-clustering algorithms to be functional, adaptable, and scalable; (2) extension of co-clustering algorithms to incorporate application-specific requirements; (3) application of co-clustering algorithms broadly to existing and emerging problems in practical application domains.As for co-clustering algorithms, we propose an improved Bayesian co-clustering algorithm. It allows a mixed cluster between rows and columns, meaning that the clustering of the objects belong to a cluster, but also belong to another cluster. This algorithm utilizes exponential family of probability distribution theory to find the generated clusters through co-clustering. At the same time, in order to automatically estimate the number of rows and columns of the cluster, we also proposed based on Bayesian information criterion algorithm to estimate the number of categories.Concerning co-clustering extensions, we propose based on fast co-clustering framework of gradually correspondence analysis method for the general co-clustering method. It does not require the whole data matrix be in main memory. This is crucial for high dimensional and large datasets. It can be implemented using different algorithms such as k-means, information-theoretic and Bayesian co-clustering methods. It implements faster than previous methods, but it achieves comparable accuracy to other methods.Regarding co-clustering applications, we extend the functionality of Bayesian co-clustering algorithm to incorporate application-specific requirements. Based on Bayesian co-clustering algorithm of gradually correspondence analysis method, it can find consistent co-cluster from high dimensional and large datasets. Its purpose is to select rows and columns, and then simultaneously clustering rows and columns through Bayesian co-clustering algorithm. Finally, we describe the results of the framework of the algorithm is applied to various simulated and real data.In summary, we present co-clustering algorithms to discover latent local patterns, propose their algorithmic extensions to incorporate specific requirements, and provide their applications to a wide range of practical domains.

Keywords/Search Tags:

high dimensional and large data, correspondence analysis, co-clustering, Bayesian co-clustering

PDF Full Text Request

Related items

1	Research On Dimensionality Reduction And Clustering Algorithm Of Commercial Data Streams
2	Clustering And Na Ve Bayesian Algorithm In Customer Value Forecasting
3	Research On A Clustering Analysis Algorithm Facing The Complex Fundamental Data Prepared
4	Research On Pricing Of Large-Scale High-Tech Products In Importation
5	Research On Topic Clustering Model Of Socail Tagging Based On Bayesian Theory
6	Research On Visual Analytics Towards Ecological Economics Data
7	System Clustering Analysis Of Multivariable Panel Data Via Probability Link Function And Its Application
8	The Application Of Clustering And Principal Component Regression In The Economic Indicator Data
9	Product Comment Data Tagging Based On Hierarchical AP Clustering
10	Research And Application Of Clustering In Commercial Bank Customer Segmentation