Font Size: a A A

Text Clustering Research Based On Semantic Distance

Posted on:2008-10-19Degree:MasterType:Thesis
Country:ChinaCandidate:L LinFull Text:PDF
GTID:2178360242478660Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Today, as the rapid development of network, people have a growing feeling about the information impact. Text is the important carriers of information 80% of the daily information people have touched is in the form of text. The information's content and format are so various and complicated that people are unable to traverse all their interested, but there is still no standard criteria for the classification of text, so it needs urgent solution to manage the collected information from the text. As a result the research of the text clustering technology is more important.Most of the current clustering methods use keyword matching based on VSM to calculate text similarity. The major drawback of this approach is it overlooks semantic information between words and the link between the various dimensions, and the result of the text similarity isn't accurate. So the paper analysis the text from the semantic, use the specific semantic of the text to compute the text similarity, the test proves the result is more reasonable. The major contributions are as follows:1,We use the famous Chinese knowledge library-《Hownet》to calculate the similarity between documents, the calculation is decomposed to several parts including semantic distance between keywords and between atoms. Considering the specific application of the text clustering, the paper uses the rules which《Hownet》describe the words to improve the existing words similarity calculation, this improvement can find the relevance between words and fit the requirements of the text better.2,Our clustering algorithm mainly uses single pass clustering (nearest neighbor clustering),and proposes the second clustering to improve the weakness of nearest neighbor clustering which is sensitive to the input order of the document. In respect of category center, the similar weight concept is introduced, we choose some feature words to represent the cluster according the weight, the remaining feature words last are similar with the main themes of the cluster, achieve the purpose of text clustering.Finally, the proposed algorithm is implemented and the testing experiments are conducted with 100 documents downloaded from CNLP Platform. Using the precision and recall of clustering as the evaluation of result, we compare the clustering results of the proposed algorithm with the K-Means algorithm base on VSM, the experiments indicated that the performance of the proposed algorithm is better than the VSM+K-Means algorithm. Moreover, the text clustering based on semantic distance shows it can divide the main theme into sub-themes, and these sub-themes can provide better navigation for the information collection.
Keywords/Search Tags:Text Clustering, Semantic Distance, 《Hownet》, VSM, K-Means, Single Pass Clustering
PDF Full Text Request
Related items