Font Size: a A A

Search in Adverse Environments

Posted on:2017-04-25Degree:Ph.DType:Dissertation
University:Georgetown UniversityCandidate:Soo, Jason JFull Text:PDF
GTID:1448390005474018Subject:Information Technology
Abstract/Summary:PDF Full Text Request
Today, search is a ubiquitous task. This task often carries the expectation that relevant results shall be returned within the first 10 documents. While the advent of modern online search engines have created such expectations, there exist environments in which such approaches are not omnipotent. These environments are defined by their lack of vital resources, such as the Internet, query logs, user models, and refined algorithms. This amalgam of resources is the keystone of the modern search systems. Without these resources, systemic error rates become intractable, and a novel, customized approach is required.;Frequently, adverse environments host information of great value. For example, medical records, personal information, historical documents, or national security data. These collections often contain errors introduced by user error, systematically (for example, by an Optical Character Recognition process), or both. Accounting for such errors, and persevering to retrieving relevant documents, is the focus of my research.;I assert a solution effectively considering both the term's context and substring features can yield superior results with minimal external dependencies when searching such adverse conditions.;In this dissertation, I present my solution for searching corrupted document collections in adverse environments. My solution---Segments---is a language independent, domain independent, unsupervised approach that I experimentally show is either as good or better than the prior art, state-of-the-art, and commonly deployed solutions. Segments achieves its results by analyzing context and substring features of corrupted terms. Segments is in use within the Archives Section of the United States Holocaust Memorial Museum to search multilingual collections with sparse query logs.;This document is dedicated to describing my experimental results, and demonstrating both the strength, and drawbacks that Segments has to offer for real world deployments.;Index words: Optical Character Recognition (OCR), Spelling Correction, Post Processing, Corrupted Documents.
Keywords/Search Tags:Search, Adverse, Environments, Results, Documents
PDF Full Text Request
Related items