Font Size: a A A

Research On Technologies Of Chinese And English Verb Subcategorization

Posted on:2010-09-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:C H ZhuFull Text:PDF
GTID:1115360332456371Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Verb subcategorization information, mainly coding the types of distribution of predicative features, is indispensable knowledge for further development in the field of natural language processing. Now that for many languages in the world subcategorization acquisition has met with much progress both theoretical and practical, while how to apply mature theoretical and resources of verb SCF to practical applications with the linguistics knowledge as little as possible, can not only deepen the study on the current theoretical understanding of SCF, but also give a new perspective to the research of lexeme and leveled syntactic structure, which has important theoretical significance and wide application prospect.There are two key issues hindering the further application of verb application. Firstly there are many format noises in texts directly extracted from real application; meanwhile, those data do not contain any syntactic information. Current verb SCF automatically obtained technology is not suitable to directly take such data as its inputs. Secondly during the automatically obtained process, some hand-written linguistics rules are needed to use as heuristic information. In order to take a thorough investigation about verb SCF, this thesis is organized as following:1. The format noises filtering,text normalization methods are analyzed. Some corrections of noise types are integrated into a unified frame, such as: paragraph division, sentence division, punctuation useage and true casing of English word. The unified model can directly take the text contained different noises as inputs, and during the filtering process, some more complex dependencies between different noise types are considered at the same time, rather than traditional methods that deals with different types of noise independently. Our method greatly improves the performance of text normalization and makes these data acceptable by the follow-up natural language processing tools.2. Studied on the joint system of Chinese segmentor and part-of-speech tagger and also given advantages and disadvantages of several different classifiers fusion methods from functional space perspective. The Joint system can conduct segmentation and POS tagging at same time that can avoid errors accumulation caused between segmentor and part-of-speech tagger in traditional Chinese lexical methods.3. Linguistic knowledge automatic extraction from the large-scale data. These knowledge which exist in the form of SCF argument corresponding can take the place of heuristic information. The process relaxes restrictions,the necessity of completely correct syntax information in traditional methods. Furthermore with active learning strategies, almostly none priori linguistics knowledge is needed during the whole process. Compared with heuristic information, the coverage of linguistic knowledge our method obtained, is far wider.4. With the weighted gap sub-sequence kernel function, verb SCF automatical analysis is carried out by supervised method. The method takes correspondings with the same argument type as training samples, and the same argument type as related classifier categories. Then with the kernel, the input space is transformed into feature space. And in feature space, which category of corresponding should be used is fixed by the similarities between current input and other types of argument corresponding. Because of the new rule usage way and the kernel function, derived result consistency of argument is substantial increased.5. Investigated on automatical extraction of English-Chinese SCF argument equivalence pairs. On large-scale bilingual parallel language corpus, a great amount of new corresponding relations are found with a simple mapping as the initial seed. Then these bilingual phrase pairs are added into the SMT system. The performance improvement indicates the validity of bilingual arguments conrrespondings our method automatically obtained.With above technologies, we finish the process that obtaining the SCF from the practice and applying the SCF to the practice. The output of each step is the input of the next step.
Keywords/Search Tags:verb subcategories, text normalization, active learning, argument analysis, statistical machine translation
PDF Full Text Request
Related items