| The field of text categorization, which aids applications such as browsing, filtering, and search, has experienced a revival due to the vast amounts of unlabeled data available on line and as part of digital collections. Almost all of the literature in the field, however, deals with the categorization of text-only documents. Many of the same techniques can be applied to text associated with multimedia documents to label the multimedia component. My dissertation provides an in-depth exploration of the automatic categorization of images using associated text. This research takes advantage of a corpus I have created containing news documents with embedded captioned images and multiple sets of categories. It turns out that the text and categories associated with images tend to have different properties than those associated with full-length text documents such as e-mails, articles, and web pages. Also, images provide us with an additional type of information; namely, low-level image features. For these reasons, I have achieved success in several areas of research that have previously been problematic, such as combining systems and using NLP techniques to improve performance. Some benefits of this work are demonstrated as part of Columbia's Newsblaster system, which finds, clusters, categorizes, and summarizes news on the web. Newsblaster has already captured the attention of the public and press; articles about Newsblaster have appeared in sources including The New York Times, USA Today, and Slashdot, and a recent analysis indicates that Newsblaster receives tens of thousands of hits every day.; The research discussed in this dissertation fits into two general paradigms. One paradigm involves research in machine learning techniques. Within this framework, I have developed two text categorization systems that use novel approaches. One of these systems applies a statistical technique known as density estimation to the output generated by another system. Density estimation provides probabilistic confidence measures for predictions and often improves accuracy. The other system, BINS, uses a binning technique to empirically estimate term weights for groups of words that share statistical features in common (instead of estimating term weights for individual words). This enables the system to compute accurate term weights for words with scarce evidence. Other work that fits into this paradigm involves the combination of a set of rules, each of which is very accurate but rarely applicable, with systems to which we can fall back when the rules do not apply.; The second paradigm of research, less common in the literature, involves novel representation of documents. Almost all modern text categorization systems use bag of words approaches, meaning that documents are represented by vectors of weighted words. Neither syntax nor semantics are considered, and no other information is used. For specific categories applying to images, however, I have found that substantial improvement can be obtained using more advanced NLP techniques. I confirm this hypothesis by presenting evidence from experiments with human subjects who have viewed image captions under varying conditions. I then describe an NLP based system that significantly outperforms all standard systems that have been tested for a specific task. Additional work that fits into this paradigm includes the use of low-level image features (e.g. color), alone or in combination with text, to categorize images. |