Font Size: a A A

Maintaining retrieval effectiveness in distributed, dynamic information retrieval systems

Posted on:1997-01-23Degree:Ph.DType:Dissertation
University:University of VirginiaCandidate:Viles, Charles LowellFull Text:PDF
GTID:1468390014983304Subject:Computer Science
Abstract/Summary:PDF Full Text Request
We present a rigorous empirical study investigating how allowing the use of subset-derived collection statistics influences retrieval effectiveness. We give a generic model for searching a document collection that allows for the use of collection statistics derived from a subset of the collection. Within this model, we identify two realistic scenarios requiring the use of subset-derived collection statistics. The first involves distributed document databases and the second involves ad-hoc search in dynamic document databases.; We view the distributed document archive as a set of collections the members of which know about some fraction of the other members in the archive. Document collections are built empirically using standard IR test collections and parametrically assigning these documents to a collection in the system. Our results show that content-skew has a pronounced negative affect on retrieval effectiveness. Content-skew is the degree to which the holdings at a particular site differ from those at another site or a globally-defined "central" site. Highly skewed document collections require more knowledge about the global collection than those that are content-uniform. However, even in highly skewed systems, sites can know about a relatively small fraction of the holdings at other sites without pronounced degradations in search quality.; We model the dynamic document archive as two collections, an "old" collection with complete statistics available, and a "new" collection composed of recently inserted documents that have not yet been incorporated into the document index and collection statistics. Our results show that retrieval effectiveness is maintained for "new" collections of realistic size when statistics from the "old" collection are used. The only problematic situation is when terms are introduced into the "new" collection that are not contained in the "old" collection.; We also give two methods for measuring content skew directly, one based on topic identification and the other based on the well-known inverse document frequency statistics. We use one or both of these methods to measure the content-skew of three kinds of document archives: our empirically defined collections, the TREC collection, and the Networked Computer Science Technical Report Library (NCSTRL), an operational distributed archive.
Keywords/Search Tags:Collection, Retrieval effectiveness, Distributed, Document, Dynamic, Archive
PDF Full Text Request
Related items