Font Size: a A A

Based On The Names Of Mongolian Corpus Automatic Identification

Posted on:2014-01-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:L G TongFull Text:PDF
GTID:1225330401458610Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
The automatic recognition of Mongolian names is one of the subtasks of the named entity recognition.With the half-century development of the Chinese and English information processing, great progress has achieved in the fields of the construction of basic resources, POS tagging, information retrieval, text categorization, machine translation, speech recognition and synthesis, and man-machine dialog. The modernization of the Chinese and English information processing greatly stimulates the theoretical and technological development of minority language information processing in China.Compared with the Chinese and English information processing, the Mongolian information processing started relatively late, but it has obtained the distinctive scientific payoffs in minority language information processing. The Mongolian information processing has accomplished the processing of characters and words and entered into the stage of sentences processing. After finishing the tasks of the superficial lexical analysis of phrase structure’s relation identification and phrase boundary defining, the Mongolian information processing is stepping forward to the deep lexical analysis. At the same time, the research of Mongolian information retrieval, automatic summarization, text categorization and machine translation is still growing. Mongolian lexical analysis and tagging is the basic scientific research in Mongolian information processing and places a great value on the research of phrases, syntax, semantics and texts. However, as a base, the lexical analysis and tagging fails to achieve equal progress in the study of unknown words and the named entity recognition in particular. The underdevelopment of named entity recognition influences the accuracy of the lexical analysis and thereby influences the development of phrase analysis, syntactic analysis, information retrieval and machine translation.Since the proper noun is an important part of the corpus, the breakthrough of the recognition of proper nouns is the foundation of the improvement of the accuracy of Mongolian lexical analysis and other follow-up studies. Ambiguity and the unknown words are the two greatest obstacles affecting the accuracy of segment, Here unknown words refer to the neologism and the named entity including names of people and places. As the fruit of the automatic recognition of Mongolian names, the present paper involves the name recognition among the unknown words, and multi-category name recognition, so that there is great academic and application value in our study.Owing that there is a great amount of names in Mongolian texts, most of which are multi-category words, and there are few studies in Mongolian names, which give little ready-make theoretical and technological reference for us, many challenges lie in the study of Mongolian name recognition, among which are as follows, ☆Name is an open collection, so we cannot adopt the exhaustive method. In Mongolian, the more common the word is, the likelier it will be taken as a name; nouns, verbs, adjectives, numerals, temporal words, adverbs, pronouns and mimetic words, any part of speech can be taken as a name. Since there is a critical multi-category phenomenon in Mongolian names, there is great difficulty in name recognition.☆The scale of the intensive processing corpus is much smaller than that of the Chinese and English, which will surely influence the application of statistical method. Though there is a2-million words intensive processing corpus in Inner Mongolia University, the author only got the access to the260-thousand words corpus. The much smaller corpus limits the rule extraction and machine learning.☆Recognition of proper nouns has been a difficulty in Mongolian lexical analysis and tagging, and, since the names of people often converse with the names of places and other proper names, the multi-category of proper nouns also becomes a difficult point for us.The present paper employs a method with statistics to identify the Mongolian names. Based on the traditional rules, it successfully applies the mathematical model of maximum entropy to Mongolian named entity recognition and realizes the automatic recognition of Mongolian names. The innovation and contribution of the present paper lie as follows,O Setting up the Mongolian name recognition corpus for the first timeAt the present time, though the expansion of the scale of Mongolian corpus pushes forward the development of the Mongolian information processing, there is still no Mongolian name recognition corpus home and abroad. The author picked up5,773sentences which contained Mongolian names to train the recognition model and test the result of automatic recognition along with the corpus in Inner Mongolia University, which made up the immature of the corpus.◇Systematically researching the internal and external structures of Mongolian namesThe author penetrates into the ethnical, regional, era and gender characteristics in Mongolian names, summarizes the internal composing models of Mongolian names, explains both the structure types and their features of changes of Mongolian names and the specific Mongolian surnames and their origins, and lists the Chinese surnames of Mongolian people.◇Formulating the tagging of Mongolian corpus and transliteration specificationBased on the analysis of the current tagging of Mongolian corpus, the author puts forward the Contemporary Mongolian Tagging Specification for Corpus. To solve various problems in the tagging of Chinese names, a detailed Latin Transliteration Schemes for the Chinese Names is formulated, which is based on the regular practice of tagging of loan words in Mongolian and taken reference from Specifications for Contemporary Mongolian Corpus Annotation.◇Setting up the knowledge base for name recognitionTo identify the Mongolian names automatically, the author sets up the knowledge bases of common names for dictionaries or mapping tables including Chinese Surnames Dictionary, Mongolian Surnames Dictionary, Dictionary of Mongolian Common Names, Latin Mapping Table for Chinese Surnames, Latin Mapping Table for Chinese Names, Dictionary of Sanskrit, Tibetan&Manchu Names, Famous Names Dictionary, Word Bank of Name Deixis, Suffix Dictionary for Place Names, Suffix Dictionary for Organization Names, and knowledge bases of multi-category names including Multi-category Names Dictionary, Collocation Dictionary for Multi-category Words, Stem Dictionary for Mongolian Names.◇Designing and Realizing the Automatic Recognition System for names with maximum entropyThe experiment proves that, as a pioneer home and abroad in the application of statistical method in Mongolian named entity recognition, the accuracy of the method adopted in the paper reaches94.56%, recall rate85.15%and F-value89.61, which represents the high efficiency in recognition.
Keywords/Search Tags:Mongolian information processing, ContemporaryMongolian Tagging Specification for Corpus, Latin Transliteration Schemes forthe Chinese Names, Mongolian name recognition, ME
PDF Full Text Request
Related items