Font Size: a A A

Automatic Identification, To Mark The Joint Structures

Posted on:2009-10-02Degree:MasterType:Thesis
Country:ChinaCandidate:D B WangFull Text:PDF
GTID:2205360245976586Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
The automatic identification of coordination with overt conjunctions (COC) will prepare the work for building the Chinese Treebank, enhance the efficiency of the parser and be used for Machine Translation and Information Extraction.Most of researches of COC was theoretical, and a few researches of automatic identification was simple COC in the past.The article identifies the COC by the method of rule and statistics, based on large corpus.The article analysed the external and internal linguistic features of COC based on statistical data.The internal linguistic features is the following: The distribution of POS sequence.The distribution of phrases sequence; The distribution of POS and phrase sequence;The distribution of length;The distribution of conjunctions. The external linguistic features is the following : The distribution of syntactic function and the features of border lexicons.For one thing ,the statistical data offers the linguistic knowledge for identifying COC;for another ,the accurate data is used to investigate the COC.According to the COC features of structure parallelism and the similarity of central words, the article identified the COC by the method of rule.The result of identification is not well, because the calculation of similarity is not accurate and the plate of POS rule is too simple. The best F scale of recall and precision in the COC of single structure and conjunction respectively reaches 62. 52%and 57. 12% ,in the closed and open tests.That indicates the method of rule which depends on the structure parallelism and the similarity of central words is not well.The dissertation introduced the background, basic principle and application of CRF in Chinese Information Processing and used it to identify the COC.The dissertation had respective test in the corpora which includes nesting COC, non-nesting COC and longest COC by the model of complicated features and model with the linguistic features. The best F scale of recall and precision in the COC respectively reaches 99.17% and 88.21%,99.99% and 87.85% and 99.98% and 84.42% in the closed and open tests.That indicates the method of statistics is better than the method of rule in efficiency and precision of identification.
Keywords/Search Tags:Coordination with Overt Conjunctions, Semantic Similarity, Plate of Rule, CRF, Model of Feature
PDF Full Text Request
Related items