Text Clustering Research Based On Semantic Distance

Posted on:2008-10-19

Degree:Master

Type:Thesis

Country:China

Candidate:L Lin

Full Text:PDF

GTID:2178360242478660

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Today, as the rapid development of network, people have a growing feeling about the information impact. Text is the important carriers of information 80% of the daily information people have touched is in the form of text. The information's content and format are so various and complicated that people are unable to traverse all their interested, but there is still no standard criteria for the classification of text, so it needs urgent solution to manage the collected information from the text. As a result the research of the text clustering technology is more important.Most of the current clustering methods use keyword matching based on VSM to calculate text similarity. The major drawback of this approach is it overlooks semantic information between words and the link between the various dimensions, and the result of the text similarity isn't accurate. So the paper analysis the text from the semantic, use the specific semantic of the text to compute the text similarity, the test proves the result is more reasonable. The major contributions are as follows:1,We use the famous Chinese knowledge library-ã€ŠHownetã€‹to calculate the similarity between documents, the calculation is decomposed to several parts including semantic distance between keywords and between atoms. Considering the specific application of the text clustering, the paper uses the rules whichã€ŠHownetã€‹describe the words to improve the existing words similarity calculation, this improvement can find the relevance between words and fit the requirements of the text better.2,Our clustering algorithm mainly uses single pass clustering (nearest neighbor clustering),and proposes the second clustering to improve the weakness of nearest neighbor clustering which is sensitive to the input order of the document. In respect of category center, the similar weight concept is introduced, we choose some feature words to represent the cluster according the weight, the remaining feature words last are similar with the main themes of the cluster, achieve the purpose of text clustering.Finally, the proposed algorithm is implemented and the testing experiments are conducted with 100 documents downloaded from CNLP Platform. Using the precision and recall of clustering as the evaluation of result, we compare the clustering results of the proposed algorithm with the K-Means algorithm base on VSM, the experiments indicated that the performance of the proposed algorithm is better than the VSM+K-Means algorithm. Moreover, the text clustering based on semantic distance shows it can divide the main theme into sub-themes, and these sub-themes can provide better navigation for the information collection.

Keywords/Search Tags:

Text Clustering, Semantic Distance, ã€ŠHownetã€‹, VSM, K-Means, Single Pass Clustering

PDF Full Text Request

Related items

1	The Research And Application Of Text Clustering Based On Improved K-means Algorithm
2	Research Of Web Text Clustering Based On Semantic
3	Chinese Text Clustering Based On Latent Semantic And Its Applications
4	The Study And Development Of Hierarchical-K-means-Based Clustering Algorithm
5	Text Clustering And Its Application Based On CFSFDP Algorithm
6	Research And Implementation Of Text Clustering Based On Fuzzy C-Means Clustering Algorithm
7	Search Of Group Intelligent Text Clustering Methods Based On Semantic Similarity
8	Study On The Chinese Text Clustering Algorithm Based On Semantic Similarity
9	Text Clustering Based On K-means Algorithm And Realization
10	Internet News Hot Mining System Research And Implementation