Font Size: a A A

A DCC-based Study On The Recognition And Observation Of Letter-word Phrase

Posted on:2006-11-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Z ZhengFull Text:PDF
GTID:1115360152988964Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
As another kind of loan words, letter-word phrases have become a new way of phrase formation in Chinese. These days, more and more letter-word phrases are used in Chinese texts. In our daily communication, however, one concept may appear in different forms of letter-word phrases. Without proper regulations, this phenomenon will affect the development of Chinese-character system and Chinese information processing. The observation and analysis of letter-word phrases in large-scale corpora would be an important reference to make a standardized guidance on letter-word phrases.Letter-word phrases, especially those combining with Chinese characters to form proper nouns and terminology, can be the unknown words in automatic Chinese word segmentation, the descriptors in information retrieval, the translation units in machine translation, and the keywords in automatic classification, automatic abstraction, speech recognition, etc. The recognition of letter-word phrases in these systems will have a direct influence on the recall and the accuracy.Therefore, we built a corpus of 100.66 million Chinese characters, which is a collection of news in 2002 from the People's Daily, Beijing Youth Daily and Yangcheng Evening News. The corpus is used to investigate the usage of letter-word phrases and find an effective way to recognize them.This dissertation carries an in-depth analysis and study of how letter-word phrases are used and how they can be recognized automatically. Special contributions of this study mainly include:1) A formalized definition has been given to ELWP (short for engineering definition of letter-word phrases) from the perspective of information processing and letter-word phrase observation. The definition is proved to be applicable in the automatic systems for extracting and annotating.2) An algorithm has been proposed for the automatic extraction of ELWPs, which uses a letter string as the anchor and searchs its left and right contexts for the boundaries of the lettered-word phrase and applies both rule-based and statistical methods. Then the automatic extraction system has been designed and implemented based on this algorithm. Experiments show an accuracy of 82% for the system. A special coding system for bilingual ELWPs is designed to recognize these ELWPs in the automatic extraction system. At present, 712 bilingual ELWPs have been obtained.3) A corpus of 560,000 Chinese characters with ELWPs annotated has been built,which can be used as the training set and test set for the automatic recognition and extraction of ELWPs.4) An error-based supervised learning approach is used to revise the rules from the automatic extraction system, and then an automatic system for annotating ELWPs has been designed and implemented based on the rules and a collocation coefficient matrix. Experiments show a high accuracy for the systems.5) This research is based on the People's Daily, the Beijing Youth Daily and the Yangcheng Evening News (Year 2002). By now, we have obtained 62058 different ELWPs (used 159235 times). The Beijing Youth Daily used 46400 different ELWPs (the highest number among the three), and the People's Daily used 5078 different ELWPs (the lowest number). There are 544 ELWPs in the common set of the three news papers, which had been used 41494 times, and appeared in 21657 texts. We have obtained 23 tables of ELWPs. Based on the ELWPs of the People's Daily, this dissertation also studies ELWPs on: monoalphabetic ELWPs, digital ELWPs, bilingual ELWPs, pure non-Chinese-characters ELWPs, punctuations in ELWPs, parallel structures of ELWPs, and the part of speech of ELWPs, etc. This quantitative and qualitative analysis could prepare a significant reference for the standardization of letter-word phrases.6) The database of ELWPs is a new contribution to the Chinese knowledge base.7) The explication of letter-word phrases have been studied, which leads to the observation of hierarchical meaning extraction for letter-word phrases. The automatic extraction of the explication for letter-word...
Keywords/Search Tags:NLP, DCC, ELWP, letter-word phrase, Letter string
PDF Full Text Request
Related items