Font Size: a A A

A term co-occurrence based framework for understanding LSI: Theory and practice

Posted on:2004-05-20Degree:Ph.DType:Dissertation
University:Lehigh UniversityCandidate:Kontostathis, AprilFull Text:PDF
GTID:1467390011474881Subject:Computer Science
Abstract/Summary:
Automatic methods for searching textual collections have been developed since the early 1960's, but a global solution to the problem remains elusive. Latent Semantic Indexing (LSI) is a well-known information retrieval algorithm. LSI is based on a linear algebraic technique, Singular Value Decomposition (SVD).; The primary goal of this dissertation is the development of a theoretical framework for understanding LSI. In particular, we study the values produced by the SVD process and determine their impact on LSI performance. We use two approaches to this analysis of values, and develop two practical applications based on our improved knowledge of the relationship between the values in the truncated matrices and the performance of LSI.; The focus in the first part of this dissertation is the development of a theoretical framework for understanding LSI. Our framework is based on the concept of term co-occurrences, and we prove that LSI encapsulates term co-occurrence information. We also show a strong correlation between the retrieval quality of LSI and the distribution of the term co-occurrence weights.; In the second part of this document, we focus our study of the values produced by SVD by implementing several practical applications. First, we determine the most critical values of the LSI matrices by reducing the density of the matrices by up to 70% without impacting retrieval quality. This reduction results in memory requirement decrease of 55% during query run time. We also develop a term clustering algorithm that is based on the LSI term matrix. This algorithm is shown to develop effective clusters for use in an emerging trend detection application. Our emerging trend detection system was able to achieve .81–.89 f-measure (beta = 1) for several collections.
Keywords/Search Tags:Framework for understanding LSI, Term co-occurrence
Related items