Font Size: a A A

Automatic identification of cognates, false friends, and partial cognates

Posted on:2007-10-17Degree:M.C.SType:Thesis
University:University of Ottawa (Canada)Candidate:Frunza, Oana MagdalenaFull Text:PDF
GTID:2445390005469808Subject:Computer Science
Abstract/Summary:
Cognates are words in different languages that have similar spelling and meaning. They can help second-language learners with vocabulary expansion and reading comprehension tasks. Special attention needs to be paid to pairs of words that appear similar but are in fact false friends: they have different meanings in all contexts.;In addition to the work done on cognate and false-friend identification we propose a supervised and a semi-supervised method that uses bootstrapping for disambiguating partial cognates between French and English. The proposed methods use only automatically-labeled data and therefore they can be applied to other pairs of languages as well. The data that we use is automatically collected from parallel corpora. The impact of data collected from different domains is also taken into account in our research.;To complement the studies that we did on cognates, false friends and partial cognate pairs of words, we developed an annotation tool for this special type of words. The tool can automatically annotate cognates, false friends and partial cognates for any French text. The tool uses UIMA (Unstructured Information Management Architecture) from IBM and BaLIE (an open-source Java project designed to extract information from free text).;Partial cognates are pairs of words in two languages that have the same meaning in some, but not all, contexts. Detecting the actual meaning of a partial cognate in context can be useful for Machine Translation and Computer-Assisted Language Learning tools. Our research on cognate and false-friend words between two pair of languages (French and English in our case) consists in automatically classifying a pair of words from two languages as cognates or false friends. We use Machine Learning techniques with several measures of orthographic similarity as features for classification. We study the impact of selecting different features, averaging them, and combining them through Machine Learning techniques. The methods work on different pair of languages as long as a small amount of annotated pairs of words is provided as training data.
Keywords/Search Tags:Cognates, False friends, Words, Languages, Different, Pairs, Data
Related items