Effects of similarity metrics on document clustering

Posted on:2010-04-08

Degree:M.S.C.S

Type:Thesis

University:University of Nevada, Las Vegas

Candidate:Veni, Rushikesh

Full Text:PDF

GTID:2448390002973919

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

Document clustering or unsupervised document classification is an automated process of grouping documents with similar content. A typical technique uses a similarity function to compare documents. In the literature, many similarity functions such as dot product or cosine measures are proposed for the comparison operator.;For the thesis, we evaluate the effects a similarity function may have on clustering. We start by representing a document and a query, both as a vector of high-dimensional space corresponding to the keywords followed by using an appropriate distance measure in k-means to compute similarity between the document vector and the query vector to form clusters. Based on these clusters we decide the best distance metric for the document set used. Next, we compute time complexities for different similarity functions for the same model and document set based on the number of iterations and number of clusters.

Keywords/Search Tags:

Document, Similarity

PDF Full Text Request

Related items

1	Design And Implement Of Dulplicate Document Detection Based On Similarity Estimation
2	Research On Cross-language Document Sorting Learning Method Based On Bilingual Document Similarity
3	Computing Document Similarity For The Legal Case Retrieval
4	Application Of Document Similarity Detection In Enterprise Document Leakage Prevention
5	Research And Implementation Of Document Similarity Based On Word2vec
6	Effects of similarity metrics on document clustering
7	Reserch And Application On Document Similarity Detection Based On Minwise Hashing
8	Similarity Computing Of Scientific And Technical Documents Based On Texts And Formulas
9	Web Page Structure Similarity Algorithms And Applications,
10	Research Of P2P Document Query Based On Semantic Similarity