Biomedical science is one of the most exciting research fields in the21th century and rapid increase in the number of related publications in recent years has leaded to a serious problem of information overload. Researchers are facing more and more great challenges on how to mine related literature effectively and track the latest research progress. Biomedical text clustering technology can finely tackle this problem and help the users to organize, summarize, navigate and locate interesting document by clustering similar literatures together and separating dissimilar literatures. Therefore, as an effective biomedical text mining tool, biomedical text clustering is of great importance in both theory and application.This thesis focuses on the development of effective algorithm for biomedical text clustering. First, a new method to calculate semantic similarity is developed and applied to a biomedical ontology, MeSH (Medical Subject Headings). Secondly, we develop a new semi-supervised clustering method SSNCut (Semi-Supervised Normalized Cut) for clustering biomedical literature. Finally, the experimental results show that this method could utilize the three types of information from biomedical literature effectively and improve the performance of clustering significantly.This thesis is organized as follows:a) The introduction of background and existing work in biomedical text clustering. We summarize some popular clustering methods and standard evaluation metrics. Moreover, we briefly review the existing studies on biomedical text clustering and point out their limitations.b) The computation of MeSH semantic similarity. We develop a novel method which could calculate MeSH semantic similarity more accurately. Furthermore, we propose two frameworks for calculating semantic similarity over MeSH. Experimental results show that new measure is superior to the original measures.c) The development of a new simi-supervised clustering method to improve biomedical document clustering. SSNCut could make use of the three types of information from biomedical literature:local content information, semantic information based on MeSH and global content information effectively. The results show that this method could utilize the three types of information effectively and improve the performance of clustering significantly. |