Font Size: a A A

Crosslingual implementation of linguistic taggers using parallel corpora

Posted on:2009-03-31Degree:M.ScType:Thesis
University:McGill University (Canada)Candidate:Safadi, HaniFull Text:PDF
GTID:2445390005453972Subject:Computer Science
Abstract/Summary:
This thesis addresses the problem of creating linguistic taggers for resource-poor languages using existing taggers in resource rich languages. Linguistic taggers are classifiers that map individual words or phrases from a sentence to a set of tags. Part of speech tagging and named entity extraction are two examples of linguistic tagging. Linguistic taggers are usually trained using supervised learning algorithms. This requires the existence of labeled training data, which is not available for many languages.;A parallel corpus of the source and target languages might not be readily available, for many language pairs. To deal with this problem, we describe a system for automatic acquisition of aligned, bilingual corpora from pre-specified domains on the World Wide Web. The system involves automatic indexing of a given domain using a web crawler, identifying pairs of pages that are translations of one another, and aligning bilingual texts at the sentence level. Using this approach we create a 40,000,000 word English-French parallel corpus from the Government of Canada domain. The quality of this corpus is evaluated and compared to other parallel corpora.;We describe an approach for assigning linguistic tags to sentences in a target (resource-poor) language by exploiting a linguistic tagger that has been configured in a source (resource-rich) language. The approach does not require that the input sentence be translated into the source language. Instead, projection of linguistic tags is accomplished through the use of a parallel corpus, which is a collection of texts that are available in a source language and a target language. The correspondence between words of the source and target language allows us to project tags from source to target language words. The projected tags are further processed to compute the final tags of the target language words. A system for part of speech (POS) tagging of French language sentences using an English language POS tagger and an English/French parallel corpus has been implemented and evaluated using this approach.
Keywords/Search Tags:Using, Linguistic taggers, Language, Parallel, Source, Approach
Related items