Font Size: a A A

Research On Automatic Construction Of Lexical Semantic Resources For Low-Resource Languages

Posted on:2018-07-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y J ZhangFull Text:PDF
GTID:1485305411484054Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
It is estimated that there are more than seven thousand languages in the world,but the majority of them are not popular,or even have dead.For their lack of available resources,or even have no resources or written languages,these languages are called lowresource languages and not supported by most of Natural Language Processing(NLP)technologies at all.Old languages are parts of the low-resource languages.However,as old languages are the source and basis of their modern ones,the studies on them are indispensable parts of Linguistics and important for historical and cultural studies.Now the researches on low-resource language processing are in the making.Existing NLP methods may be statistics-based or rule-based,but neither of them can be used to processing low-resource languages directly.For statistical methods,their well-performances depend on large-scale data.However,it does be hard to obtain resources for low-resource languages.For ruled-based methods,processing different languages usually need different rules,and it is impossible to build rules for thousands of different languages.In this paper,we focus on the difficulty of obtaining resources for low-resource languages,and try to carry out the research of automatic construction of their lexical semantic resources,with a case study on Pre-Qin Ancient Chinese(PQAC).We do the research in two aspects.One is all-words senses annotation for a raw corpus,the other is automatic construction of a structural lexical semantic resource,i.e.,wordnet.First,we propose a method to build an all-words senses annotation corpus in low-resource language based-on the glosses and examples in a traditional MachineReadable Dictionary(MRD),via a semi-supervised Word Sense Disambiguation(WSD)procedure.The result of this work is helpful to much fields of researches in lowresource language,such as Information Retrieval,Machine Translation,Semantic Understanding.The minimal annotation training data is extracted from MRD's glosses and examples,thus it is credible and has high coverage of target words' senses.Vectors of words learned from raw texts in a raw corpus are also used to represent the local context of target words,which reduced the sparse of training data.The tagging result on PQAC,which is a low-resource language,reaches an average precision over 75.94%by sampling evaluation.Second,we propose a hierarchical procedure for Thesauruses integration.The procedure is done in a top-down manner,which solves the heterogeneous problem between two thesauruses and simplifies the integration process.We apply this procedure on integrating Tongyici Cilin(Cilin),a Chinese-specific Thesaurus,into Chinese Concept Dictionary(CCD),another Chinese Thesaurus whose concepts are one-one corresponding to English.And at last,we obtain a new resource,called CCD-Extend,containing both advantages of CCD and Cilin.We apply a lexical semantic similarity method using CCD-Extend and find that it performs better than the ones using CCD or Cilin.By using CCD-Extend as the intermediary between English and PQAC,it is possible for us to obtain a structural lexical semantic resource in PQAC that collects not only the common concepts with English but also the Chinese-specific concepts.Third,we propose a method to constructing a wordnet for low-resource language to enrich its structural semantic resource by mapping glosses in a MRD onto Synonym Sets(Synset)in Princeton WordNet.The result links the low-resource language to world's languages and can be treated as a part of Global Wordnets.In this work,because it is hard to mapping a sentence to a Synset semantically,we propose a strategy of replacing glosses by their kernel words.For the kernel words may be polysemous and have no equivalent senses among them,we apply a Graph-based WSD procedure to find their core sense.At last,we apply our method on the mapping between PQAC and English,and the result achieves a precision of over 85%.This method is also extended to Middle Ancient Chinese and Near Ancient Chinese,with the mapping precision of over 80%and 90%respectively.It shows low requirement of resources,high degree of automation,and good scalability.
Keywords/Search Tags:Low-resources language progressing, Resource construction, Pre-Qin ancient Chinese, All-words sense annotation, Thesauruses integration, Global WordNet
PDF Full Text Request
Related items