Font Size: a A A

Identification Of Non-clauses Among Lanuage Fragments Between Punctuation In Chinese Complex Sentences

Posted on:2009-02-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q LiFull Text:PDF
GTID:1115360245957572Subject:Chinese Philology
Abstract/Summary:PDF Full Text Request
In order to meet the needs of Chinese information processing undertaking,after the word and character processing reaping their first fruits,sentence processing has been put on the important agenda.Natural language understanding is,in the ultimate analysis,the understanding of each sentence.Sentences of Chinese consist of both single and complex sentences,in which the machine-understanding of complex sentences is both the key and difficult point.Besides that the machine-understanding of complex sentence is in the inevitable establishment on the basis of the simple sentence understanding,another important reason is,division of level and logical semantic relations among clauses are involved in the machine-understanding of complex sentence.Besides,along with the improvement of the computer hardware and software technology,as a supplementary to the rule-based rationalism,the statistic-based and example-based corpus method become popular with the calculation linguistics increasingly.It is in this context we try to establish an exquisite-processing modern Chinese complex sentence corpus,for providing relevant knowledge and statistical data to the computer understanding of complex sentence.This paper is part of corpus construction,whose main objective is to eliminate those language fragments between punctuation which don't participate in the division of levels and relations of complex sentence,i.e.the identification and tagging of the non-clause language fragments.Its main contents are as following:Chapter one first retrospect our predecessors' study on the tangled problem of simple and complex sentence,analyzed their complex phenomenon,and tried to view the objective fact from the perspective of prototype of cognitive linguistics,defined the nature and scope of non-clause(clause)with Theory of Clause as Nucleus,then made computer make a preliminary identification by the use of punctuation as marker,ruled out some of these non-clauses;finally gave a detailed description of the classification for these non-clauses which can't be identified by computer.Some of the non-clauses are made by the arbitrariness of Chinese punctuation,some are by the complex component of some clauses,and some are formed by some special components as language fragments.Chapter two firstly,gave a tagging introduction of POS and non-clause.Secondly, according to the theory of Clause as Nucleus and verb-centered theory,we used the marking POS information to make automatic identification for some non-clauses which don't include verb.Finally,in this chapter,we put some language fragments which are relatively fixed phrases into the phrase stock and made automatic identification through formulating a series of rules.Chapter three fulfilled the automatic identification and tagging of some non-clauses by the use of syntactic information.Firstly,gave a brief description of the work-mode of computer processing of natural language,then discussed two types of formal markers and their function of identification and marking:one is dominant formal marker,such as preposition as beginning marker and word of time and place as ending marker;the other is structural auxiliary "的" and judgment verb "是".Besides,we made other series of rules of computer automatic identification and marking of non-clause on this basis and added them to the rule-data.Chapter four is experiment and its results based on the compiled rules in the first two chapters.First we set up an ACCESS data base,put the beginning marker and ending marker into the base.When judging whether a language phrase is a non-clause or not,we can use simple character-string matching,matching the beginning or ending part of each language fragment with the beginning or ending part of the data base input.If the match succeeds,it is a non-clause.Then we tested the correct rate of identification and tagging rules by man-made method one by one,and analyzed the wrong reasons and future improvement strategies.Especially,rules in chapter two and three are all developed in training collection,so we have to count their contribution rate in training collection and apply these rules to the entire complex corpus to test the correct ratio,and improve and refine the rules constantly.Chapter five tried to identify some non-clauses with the comprehensive utilization of syntactic,semantic and collocation knowledge,however,this method is still in pilot phase.This chapter first discussed the importance of semantic knowledge in natural language understanding of computer,then introduced the overview of semantics study in face of computer in China and abroad,and presented the semantic theory which we use in this paper.And then we explicated the working premise according to the specific circumstances of our study,including the selection and limitation of research corpus and the problem-solving ideas.Then focusing on semantic role,semantic type and semantic characteristics,we tried to set up verb-object matching framework for 127 sense items of 108 verbs in dictionary entry,and presented 18 standards to judge whether the relation between two nouns after verb is "pianzheng",then used the matching framework to analyze some examples in sub-corpus2,put forward an assumption to set up a table of verb-object matching frequency.There is a summary at the end of this chapter.Chapter six laid down a series of identification rules in case that noun is the nucleus in a sentence according to Professor Xing Fuyi's theory from the theoretical point, although few examples can be found in training corpus.But these rules have not been realized into computer program and artificially tested in actual corpus for the time being.
Keywords/Search Tags:Theory of Clause as Nucleus, Language Fragments between Punctuation, Non-clause, Automatic Identification, Syntactic Information, Semantic Knowledge, Rules
PDF Full Text Request
Related items