Font Size: a A A

Modern Chinese Verb With More Than The Perspective Of Its Automatic Identification

Posted on:2009-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y ChengFull Text:PDF
GTID:2205360245976505Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
The word collection is a combination of the recurrent words that have some grammar relations, which is the combination relationship of each word. The word collection plays an important role in some fields, such as the automatic syntactic analysis and the machine translation (MT). The verb-object collection catches extensive attention for its high frequency, complexity, and flexibility of the usage, and is the core issue of building the word collection library. Relying solely on the construction of artificial selection is undesirable. The natural language processing (NLP) is to make a further study on the word collection to find out the method of the computer larger-scale processing.Based on the concept of the generalized collection and the Tsinghua Chinese Treebank tagged accurately, two aspects are realized for the verb-object collection: the one is to study the verb-object from multiple angles; the other one is to automatically obtain and recognize the verb-object collections.In the first part, about 50611 pairs (tokens) of the verb-object collection instances abstracted from the database are studied with the method of both the qualitative and quantitative. The qualitative method is to analyze the grammar, words, and semantic meaning (logic) that are involved in the word collection in terms of the word order and the part-of-speech of the collection, verb grammatical attributes, and lexical semantic roles of the collection, which provides the theoretical reference for the follow-up identification. The quantitative method is to introduce the common statistics that are automatically obtained during the collection, and to analyze the verb-object collections of the Treebank in terms of the collection frequency, mutual information, distance average, and the variance to judge which data will be used in the subsequent recognition stage.The second part is to automatically obtain and recognize the verb-object collections based on the traditional statistic method and the statistical machine learning method. In the traditional statistic method, the processing model is relatively simple and the statistic index is single, such as the co-occurrence frequency and the mutual information. The recognized result F value is about 50%, which is not an ideal result. Therefore, a complicated statistic model CRFs based on the machine automatic learning is used to obtain the verb-object collections automatically.Experiment in this paper gives the results varied with different participialization and part-of-speech tagging sets, different restraints of the sequence type of the part-of-speech combination, and different corpus (origin and size). Additionally, this experiment tests the changes on the stage of setting features, which are brought by the syllable features, verb subcategorization features, context features, and features of their combination. Comprehensive experimental results show that the best result is F = 87.40% which based on the segmentation tree and part-of-speech marking, and the test result is F=74.70% which based on the Peking University segmentation tree and part-of-speech marking.The automatic identification results show that CRFs indeed effects in the sequence of tagging. There is still improving work to do in follow-up identification.
Keywords/Search Tags:Verb-Object Collocation, Automatic Identification, Syntactic Parsing, CRFs, Feature Templates
PDF Full Text Request
Related items