Research On CO-Clustering Ensemble For Document

Posted on:2015-01-06

Degree:Master

Type:Thesis

Country:China

Candidate:C Y Chun

Full Text:PDF

GTID:2268330428976188

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

Clustering is a very important technology in data mining and widely used in many research fields. With the popularing and developing of the Internet, documents explosively grow in recent years, so clustering is widely used in document clustering. Documents in the same cluster are similar to each other and documents in different clusters are dissimilar to each other. Document clustering is an unsupervised learning process. Co-clustering makes some improvements based on document clustering. The document attributes and characteristic attributes are simultaneously clustered to improve the performance of the traditional document clustering. In single co-clustering, it is difficult to express distribution structure of data sets because of the unstable performance. In order to improve the stability of the algorithm, scholars has been put forward the concept of clustering ensemble in recent years. In ensemble step, a consensus function is used to cluster, at finally stable clustering results are obtained.Co-clustering simultaneously clusters the document attributes and feature attributes, and fully consideres the similarity between document and document, feature and feature, document and feature. Because of the struction of document data (unstructured or semi-structured), the documents need to be represented before document preprocessing. In the traditional vector space model, feature terms are independenty, so traditional co-clustering ignores the similarity between word and word. In this thesis we adopt double words vector space model to reprent documents, and double words having highly frequency are reserved. Double words vector space model not only retaines all the information of traditional vector space model but also adds more document information. Experimental results show that the effect of co-clustering based on double words vector space model is better than the traditional vector space model co-clustering.Due to the high dimentions and sparse in data matrix, we need to reduce dimensions before clustering. Variance volatility describes a feature vectorâ€™s contribution to the document clustering, ignoring the similarity between feature and feature. Correlation coefficient not only expresses the contribution degree of a feature vector, but also describes the similarity between feature and feature, at the same time matrix block greatly reduces the running times of algorithm. Document clustering results correspond to the feature clustering results, so we may easily discover the document topics.Ensemble enhances the stability of clustering, but most of traditional ensemble is based on the original data points. With the increasing of original data points, the time complexity of algorithm is growing exponentially. Data fragment ensemble gets multiple data fragments on clustering results (the number of data fragment is less than the original data points), and uses a consensus function to obtain the final ensemble results. In this thesis, we propose a data fragment ensemble based on squared residuals algorithm. Experimental results show that co-clustering ensemble improve the stability and efficiency of co-clustering; data fragment ensemble reduces the time complexity than the traditional ensemble; the data fragment ensemble based on squared residuals outperforms the hierarchical data fragment ensemble.

Keywords/Search Tags:

Document co-clustering, Correlation coefficient, Matrix block, Data fragment, Squared residuals

PDF Full Text Request

Related items

1	The Research Of The Key Techniques Of Document Fragment Forensics
2	Research Of Co-clustering Algorithms For Cancer Subtypes Discovery Based On Gene Expression Data
3	Research On Design And Evaluation Of Block Cipher With CPA-Resistance Capability
4	Research On Correlation Rules Mining Algorithm Based On Matrix
5	Research On Data Prediction Based On Filtering Fusion
6	Study On Interconnection Relationship Compute For XML Fragments
7	The Research Of The Clustering Ensembles Based On SEAM Algorithm And It's Application On Text
8	Research On Clustering And Classification Algorithm Of Streaming Data
9	Semi-supervised Non-negative Matrix Factorization And Its Application In Document Clustering
10	Research On Efficient Document Clustering Using Improvised Sub-Document Based Framework