Font Size: a A A

The Application Of Vector Space Model In The Similarity Research Of Medical Literature

Posted on:2007-12-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y H QiuFull Text:PDF
GTID:2144360182492063Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
ObjectiveThe similarity research is the important topic in the knowledge management, which can improve the development of knowledge management and the efficiency of literature retrieval. In order to improve the precision and the quality of the literature retrieval, this study explores similarity computing scheme based on vector space model (VSM) , and describes the clustering algorithm of bio-medical literature on the basis of similarity model. The results were used in the medical literature retrieval.MethodsThe study included two parts, one is algorithm research, and another is applied research. The method of algorithm research is as follow: papers in three projects were selected and retrieved from MEDLINE, the similarity was computed between papers, and relative similarity between paper and searching query was also studied. The procedures include the building term database, converting document vector, computing term weight, evaluating similarity. Two methods were used in the building of term database: firstly, a group of 4600 papers as sample were retrieved randomly from database of MEDLINE;secondly, papers of selected projects were collected with MEDLINE database. Term Frequency (TF) and Term Frequency - Inverse Document Frequency (TF -IDF) were used to compute term weight, respectively. Four computing schemes of the similarity between papers were used in each project. Group1. Papers of selectedproject was as population and the method of TF was used;Group2. Papers of selected project was as population and the method of TF - IDF was used;Group3. The group of 4600 papers was as population and the method of TF was used;and Group4. The group of 4600 papers was as population and the method of TF -IDF was used. The similarity from different schemes was clustered by adopting hierarchical cluster analysis. The effectiveness of clustering was evaluated by dendrogram and F - measure. The effective algorithm was applied and analyzed in the evaluation of literature retrieval with MEDLINE and CBM database.ResultsAccording to the dendrogram of project 1, the rank of clustering effectiveness was group 1 < group2 < group3 < group4, the rank of F - measure was group2 < group 1 < group3 < group4. In project 2, the rank of clustering effectiveness was group 1 < group2 < group3 < group4, and it was group2 < group4 < group3 < groupl in project 3. Applying the method of similarity algorithm and cluster analysis in Group 4 to the literature retrieval of MEDLINE and CBM database, the papers searched can be ranked by the degree of similarity and the clustering group of papers, and likelihood between papers and query.ConclusionThe method of similarity measurement in group4 (collected papers plus TF - IDF) is the most effectiveness in the computation of similarity, it can realize document clustering and ranking effectively. The measurement of similarity based on vector space model and clustering analysis can improved the efficiency of literature retrieval in MEDLINE and CBM database, and also promoted bio-medical research works. In order to perfect term database, quantity of collected documents should be increased, including all aspects of biomedical researches.
Keywords/Search Tags:Information storage and retrieval, similarity, vector space model, cluster analysis
PDF Full Text Request
Related items