Font Size: a A A

Modern Chinese Words With The Automatic Extraction Method

Posted on:2007-10-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y J ChenFull Text:PDF
GTID:2208360182461570Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Computer language is a cross discipline related to computer science and linguistics. By means of constructing formalized computing model to process natural language, computer language makes computer understand the language of human being, express itself using language of human being, and furthermore, help people to overwhelm the difficulty in communication between peoples using different languages.The extraction of modern Chinese collocation studied in this thesis includes not only the knowledge of linguistics but also computer science and other knowledge. Through the studies on the knowledge of modern Chinese such as grammar, syntax, semanteme, pragmatic, the internal regularity is discovered and the collocations are considered to be the organic compositions of all natural human being language which can be materialized. Hence, it has the most important applications in language learning and Chinese information processing.Currently, collocation candidates can be obtained by extracting collocations according to statistic on frequencies of the co-occurrence words, by using ratio of relative word rank (RRWR) method to filter the collocation candidates, by applying the assemblage rule of collocation in linguistics to restrict the part of speech of collocation candidates, or by using statistical language model like mutual information to the automatic extract collocations from numerous words stuff. Based on previous research methods, the frame-based extraction method is presented, in which the collocation is extracted using statistical model after the setting of extracting frame. On the basis of researches on linguistics, I set up an extraction model involving the combination of statistic and linguistics knowledge. Corresponding algorithm and software are also developed to realize automatic extraction of the collocations, information on positions of the words, and the grammatical structures simultaneously.The result was validated by extracting the collocations "nang li" and "zhi liang", and corresponding accuracy is 84.33 % and 73 %. At the same time, information on positions of the words is obtained. All the information is important resource of Chinese information.
Keywords/Search Tags:Chinese colloction, Ratio of relative word rank (RRWR), Mutual information, Collocation frame
PDF Full Text Request
Related items