Font Size: a A A

Design Of A Subtitle Corpus (MMSC) And Its Applications

Posted on:2008-09-09Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhuFull Text:PDF
GTID:2155360212999844Subject:Foreign Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
Corpus, especially parallel corpus, has become indispensable in many linguistic researches including translation and natural language processing studies. However, due to the limited sources of bilingual or multi-lingual materials and the difficulty in processing them, development of parallel corpora has lagged far behind that of other types of corpora.Meanwhile, with the appearance and prevalence of DVDs and Internet, the volume of film and television subtitles (captions), which are bi-lingual or multi-lingual by nature, grows fast. Currently, thousands of film and television subtitles are easily accessible and the number is still increasing rapidly.Therefore, the author makes an attempt to build a parallel corpus using the voluminous subtitles available on line or from DVDs, namely, the Mass Media Subtitle Corpus or MMSC in short. MMSC is open and extensible in design, with a framework that allows easy online accesses as well as convenient maintenance. Users and system managers can submit new subtitles through an online portal and let the system align and process them automatically. At the completion of the thesis, MMSC contains more than 3,000,000 words, 200,000 parallel units and more than 250 film or TV programs, and is expected to receive much more texts from users and donators in due course.The thesis centers on the design and creation of MMSC, which contains several steps including overall design, subtitle selection and collection, text alignment, text annotation, concordance platform and maintenance interface design, etc. In the text alignment part, the author proposes a new aligning algorithm specially designed for subtitles, which utilizes the time code information in subtitles to align the bi-texts and is totally different from traditional algorithms that take statistical approaches.Then the thesis reports several pilot studies on MMSC in an effort to discuss its potential usages in translation studies, translator training and English teaching, etc.In the end, the author talks about the further development of MMSC and suggests one of its possible upgraded versions.
Keywords/Search Tags:corpus, film, television, subtitle, translation
PDF Full Text Request
Related items