Font Size: a A A

Bayesian Text Segmentation for Terminology Extraction

Posted on:2013-08-22Degree:M.SType:Thesis
University:University of California, IrvineCandidate:Koilada, NagendraFull Text:PDF
GTID:2455390008468483Subject:Artificial Intelligence
Abstract/Summary:
Automatically extracting terminology and index terms from scientific literature is useful for a variety of digital library, indexing and search applications. This task is non-trivial, complicated by domain-specific terminology and a steady introduction of new terminology. Correctly identifying nested terminology is both interesting and challenging. Commonly-used approaches rely on the knowledge of document structure and supervised learning techniques to retrieve terminology. We present a new approach called Dirichlet Process Segmentation (DP-Segmentation) to identify key terms. This method is a Bayesian technique that is based on a probabilistic generative model for production of multi-word segments. DP-Segmentation outperforms previous methods for solving this problem of extracting nested multi-word terminology. In addition, the method has the advantage of being very robust. It is language independent, and does not require parsing or part of speech tagging. As such, DP-Segmentation has potential applications beyond extraction of index terms, such as segmenting Chinese text.
Keywords/Search Tags:Terminology, Terms
Related items