Font Size: a A A

Research On CO-Clustering Ensemble For Document

Posted on:2015-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:C Y ChunFull Text:PDF
GTID:2268330428976188Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Clustering is a very important technology in data mining and widely used in many research fields. With the popularing and developing of the Internet, documents explosively grow in recent years, so clustering is widely used in document clustering. Documents in the same cluster are similar to each other and documents in different clusters are dissimilar to each other. Document clustering is an unsupervised learning process. Co-clustering makes some improvements based on document clustering. The document attributes and characteristic attributes are simultaneously clustered to improve the performance of the traditional document clustering. In single co-clustering, it is difficult to express distribution structure of data sets because of the unstable performance. In order to improve the stability of the algorithm, scholars has been put forward the concept of clustering ensemble in recent years. In ensemble step, a consensus function is used to cluster, at finally stable clustering results are obtained.Co-clustering simultaneously clusters the document attributes and feature attributes, and fully consideres the similarity between document and document, feature and feature, document and feature. Because of the struction of document data (unstructured or semi-structured), the documents need to be represented before document preprocessing. In the traditional vector space model, feature terms are independenty, so traditional co-clustering ignores the similarity between word and word. In this thesis we adopt double words vector space model to reprent documents, and double words having highly frequency are reserved. Double words vector space model not only retaines all the information of traditional vector space model but also adds more document information. Experimental results show that the effect of co-clustering based on double words vector space model is better than the traditional vector space model co-clustering.Due to the high dimentions and sparse in data matrix, we need to reduce dimensions before clustering. Variance volatility describes a feature vector’s contribution to the document clustering, ignoring the similarity between feature and feature. Correlation coefficient not only expresses the contribution degree of a feature vector, but also describes the similarity between feature and feature, at the same time matrix block greatly reduces the running times of algorithm. Document clustering results correspond to the feature clustering results, so we may easily discover the document topics.Ensemble enhances the stability of clustering, but most of traditional ensemble is based on the original data points. With the increasing of original data points, the time complexity of algorithm is growing exponentially. Data fragment ensemble gets multiple data fragments on clustering results (the number of data fragment is less than the original data points), and uses a consensus function to obtain the final ensemble results. In this thesis, we propose a data fragment ensemble based on squared residuals algorithm. Experimental results show that co-clustering ensemble improve the stability and efficiency of co-clustering; data fragment ensemble reduces the time complexity than the traditional ensemble; the data fragment ensemble based on squared residuals outperforms the hierarchical data fragment ensemble.
Keywords/Search Tags:Document co-clustering, Correlation coefficient, Matrix block, Data fragment, Squared residuals
PDF Full Text Request
Related items