Automatically extracting terminology and index terms from scientific literature is useful for a variety of digital library, indexing and search applications. This task is non-trivial, complicated by domain-specific terminology and a steady introduction of new terminology. Correctly identifying nested terminology is both interesting and challenging. Commonly-used approaches rely on the knowledge of document structure and supervised learning techniques to retrieve terminology. We present a new approach called Dirichlet Process Segmentation (DP-Segmentation) to identify key terms. This method is a Bayesian technique that is based on a probabilistic generative model for production of multi-word segments. DP-Segmentation outperforms previous methods for solving this problem of extracting nested multi-word terminology. In addition, the method has the advantage of being very robust. It is language independent, and does not require parsing or part of speech tagging. As such, DP-Segmentation has potential applications beyond extraction of index terms, such as segmenting Chinese text. |