Font Size: a A A

Study Of XML Documents Clustering In Web Mining Domain

Posted on:2010-08-30Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhaoFull Text:PDF
GTID:2178360275962623Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, data on web is having an explosive growth. How to get desired knowledge from the mass of Web data efficiently and accurately become a hot research topic. In this environment, Web mining technology was born, Web mining is aimed to obtain potential, valued knowledge or pattern from the Web, the main technology of Web mining such as classification, clustering and feature selection, has been developed rapidly. Cluster analysis plays an important role in Web mining, cluster can divide an object sets to several categories according to some similarity measure based on some criteria, the objects in the same categories are most similar , objects between different cluster are most different. As a preprocess stage, clustering can improve the efficiency and accuracy of data mining through classifying the data sets. The majority of Web pages are text documents in the form of HTML, but with the diversity and complexity of Web data, HTML documents can not meet the requirements of information exchange and information processing. XML is an alternative standard which is put forward by W3C. Because of the features such as flexibility, opening and self-explanation, XML has gradually become the main Web data format and data exchange standards. Therefore, XML Clustering research is of great significance. This thesis takes a systematic analysis and research on XML clustering and put forward a semantic feature extraction method. Some improved clustering algorithms are proposed, and we conduct experiments on real datasets and artificial datasets. The work and innovation of this thesis are as follows:In the first, clustering algorithms and XML-related definition are summarized and analyzed, and then we point out that the lack of commonly used clustering algorithm in the current field of documents clustering. Secondly, we focus on the key problem of XML document clustering - the document similarity measurement methods, study the classical edit distance and document similarity measurement methods based on edge set. After analyzing the space vector model, we proposed a XML document vector model based on the combination of tag and path, the right values of features are defined according to the level of the document tree, which can express the semantics of nested XML elements. We calculate the similarity between the example documents through our method and the two methods mentioned above, results show that our method has a better documents distinguished ability. Machine learning techniques are important Web Mining technical support, ensemble learning and semi-supervised learning are emerging in recent years. Substantial research and experimentation have proven ensemble learning and semi-supervised learning can improve the performance of clustering and classification. Based on the study of ensemble learning and semi-supervised learning we improve the traditional single clustering algorithm. In order to improve the weaknesses of single clustering algorithms, we propose a clustering approach based on Bagging algorithm. On base cluster generation stage, we use the bootstrap method to sample the original document set, resulting in a number of subsets of the original document set, run the partitional clustering algorithm, then imply clustering consensus rate to remove low-quality cluster center, finally, we run hierarchical clustering on the collection of cluster centers generated by partitional clustering method to get the result. Because of the higher computational complexity, we proposed a clustering method based on semi-supervised learning to improve the ensemble clustering algorithm. We run FCM clustering algorithm and pause on an identified iterate to sample the original document set, then combine the data near the cluster centers into a new data set, we appy hierarchical clustering method to get right number of cluster centers which then help FCM continue to final result. Finally, we apply these algorithms on the real and artificial set separately, and results show that the clustering algorithm proposed is better than single clustering algorithm, and has higher robustness.
Keywords/Search Tags:XML, Vector Space Model, Ensemble Learning, Smi-supervised Learning
PDF Full Text Request
Related items