Font Size: a A A

The Construction Of Integration Platform For-Mongolian Corpus Processing

Posted on:2016-08-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:J X WuFull Text:PDF
GTID:1225330461980880Subject:Chinese Ethnic Language and Literature
Abstract/Summary:PDF Full Text Request
Corpus is a collection of real natural language works according to certain principle. After processing, corpus can be useful resource, which can be used in a variety of Natural Language Processing (NLP) system. The processing, is the procedure of mining the hidden information of corpus. According to different granularities, the processing of corpus can be divided into lexical analysis, phrase, sentence analysis and semantic annotation.Ten million Corpus of Mongolian has been constructed. And the processing of Mongolian corpus includes many aspects such as morphological processing, syntax analysis, semantic tagging. However, the representatively multi-level processing corpus has not been constructed yet. Most researchers extracted corpus from large-scale corpora and annotated different levels according to their own standards with their respective experimental purpose. This not only leaded to a lot of repeated work, but also caused that corpora between each other are not universal, and most of them cannot be applied directly to other studies. Therefore, the construction of large scale annotation corpus widely used in Mongolian information processing is very necessary.This study is based on the corpus linguistics theory and method, and constructs Mongolian multistage processing corpus with morphological processing-Named Entity tag-fixed phrase tag-semantic type annotation. According to the machine based and artificial proofreading, the paper selects Mongolian representative corpora-ten million words Modern Mongolian corpus annotate morphological structure, named entity tagging such as Name, Location and Organization name, fixed phrase and semantic type.In previous work, we developed the Mongolian lexical analysis system-Mglex, which made about 90% accuracy rate in the 200000 word level training corpus. But the system has not been on the named entity recognition. Named entity recognition is an important part of Mongolian lexical analysis system. A complete Mongolian lexical analysis system includes not only the lexical tagging, but also includes named entity recognition. In addition, named entity recognition is an important foundation of information extraction, information retrieval, chunk parsing, machine translation and question answering system technology, the research results will directly affect the deep research on text information automatic processing. Therefore, this paper research into the Mongolian named entity recognition software, the concrete research contents include:(1) The paper recognizes Mongolian Name and Location using conditional random field model and rules based method. According to the characteristics of Mongolian Names and Location, we selected 6 and 5 kinds of characteristics as the CRF model’s feature, for the category name, the paper adopted the method based on rules. Finally we correct errors and recall Names and Locations using dictionaries and rules. The experimental results shows, recognition accuracy rate of the Names and Locations reached 94.56% and 94.68%, the recall rate reached 90.60% and 84.40%, F score reached 92.54% and 89.24%.(2) About Organization, we proposed some rules and a knowledge based method. According to the grammatical characteristics of Mongolian organization name, the research summed up the constitution of simple organization name and composite organization name, designed an effective recognition rules and the corresponding knowledge base, realized the recognition of Mongolian organization name. We select article from the Political News section segment 243 (contain 417 organization name) of China Mongolian News as the testing date. The experimental results shows, the system on the test set yielded accuracy rate of 73.75% and recall rate of 67.38%.In addition, this paper also introduces research work of Mongolian morphological annotation software improvement. We improved the Mglex software from four aspects, such as the corpus of pretreatment, the candidate words optimization, disambiguation and post treatment. The paper proposes collocation-based methods of disambiguation, and put forward co-occurrence frequency methods of phrase collocation acquisition. Through the improvement, this Mglex software achieves a word-level segmentation accuracy of 97.80%, and a word-level joint segmentation and tagging accuracy of 94.00%.
Keywords/Search Tags:Mongolian corpus, multistage annotation, Mongolian named entity recognition, lexical analysis system, collocation
PDF Full Text Request
Related items