Font Size: a A A

A Study On Chinese Organization Names Based On Dynamic Circulating Corpus

Posted on:2009-02-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:H ChenFull Text:PDF
GTID:1115360302473188Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
Chinese organization names are the specific form addresses for Chinese organizations.We have carried out deep and through research on Chinese organization names on large-scale corpus from the point of macroscope and microscope views in the diachronic and synchronic axis.The aim of this paper is to provide language resource and valid regulations for Chinese information processing,provide standard for making Chinese organization names,reference for management and registration in the field,and outlook for language resource monitoring.The achievement of this article has characters as the following:First,We define the connotation and extension of Chinese organization names,to distinguish them with non-Chinese organizations,put forward the classification system according to the headwords.Second,Based on the DCC corpus,a corpus of Chinese organizations names is established, and the resource corpus is also established.Materials from six newspapers of mainstream plain media from 2002 to 2006 are selected,which have the size of 1360416 texts,8750105 types, 247257749 tokens,and 16 billion bytes.The resource corpus of Chinese organization names includes the two main databases and five sub databases.Two main databases are:①"original information of Chinese Organization names corpus" 3954716 Organization names with their POS(org,aorg),many attributes such as text field attributes,time attributes, context attributes are included.②"general table of Chinese organization names" 615,681 kinds of Chinese organization names are identified with the mark of headwords and the second POS,the character length, word length,numbers,frequency,cumulative frequency,scattered numbers in texts,newspapers, and in years.The five sub databases are:①"characters of Chinese organization names corpus":which records 5241 token and 23130786 characters in the general table of chinese organization names of the table.②"words of Chinese organization names corpus":Which records 70110 tokens of 36 categories and 2352589 words from the top 600000 chinese organization names in the general table of chinese organization names of the table.③"forbidden words thesaurus of chinese organization names":which records 11 types of forbidden POS,six types of forbidden characters and forbidden words in 3 kinds of substantive.④"the commonly-used Chinese organization names corpus":which includes the 15,970 correct Chinese organization names,and cumulative frequency is 70%. ⑤"The abbreviation and the whole names of Chinese organization table" which includes 3000 pairs from the general table of Chinese organization names.Ⅲ.Research on distribution of Chinese organization names is carried out from the point of frequency,length of characters,domains,time span,and kinds of newspapers.Domain characters of Chinese organization names are put forward,whose meaning in text classification and common words is also analyzed.Ⅳ.Research on the structure,composition,abbreviation and context of Chinese organization names is carried out.Two models of Chinese organization names are put forward.Four categories of structural components are studied in the form of shape,quality,and regulation context.Nine principles of abbreviation for Chinese organization names are fixed.Three kinds of collocation are identified.The application value of abbreviation model and collocation of Chinese organization names is pointed out in disambiguation and shallow parsing.With the help of regulations,the identification scheme of Chinese organization names is implemented. The result shows that forbidden POS method can filter automatically 85700 kinds of mistakes, which amounts to 13.92%of total data.The result shows that forbidden words method can filter automatically 44307 kinds of identification results,which amounts to 7.2%of total data.The result shows that non-headwords of Chinese organization names method can filter automatically 11711 kinds of identification results,which amounts to 1.9%of total data.Ⅴ.The method and value of automatically monitoring Chinese organization names is pointed out.Experiment on annual organization names is tried.Ⅵ.Concrete standardization suggestion is put forward after analyzing Chinese organization names,invalid usage and new questions in detail.The future job would be carried out to build more complete resource database,devise an applicable semantic thesaurus,do deep research on forbidden words,classification system, inner structure,etc.
Keywords/Search Tags:Chinese organization names, DCC Corpus, resource database of Chinese organization names, Identity of named entity, National language resource monitoring, language standardization
PDF Full Text Request
Related items