Font Size: a A A

Research On Methodology For Extraction Of Fixed Phrases In Education Field

Posted on:2010-10-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:R LiuFull Text:PDF
GTID:1115360302972993Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
Words and phrases are concrete materials for languages. Phrases are larger materials than words. However, researches on phrases are very limited with comparison of words. Currently, linguistics researchers are interested in fixed phrases not included in idioms and proverbs. Information processing industry needs more fixed phrases other than idioms and proverbs, especially those fixed phrases which are "integrated closely and used steadily". Word segmentation units defined in information processing industry include both words and phrases which are "integrated closely and used steadily". Unfortunately, there is no effective method to catch phrases which are "integrated closely and used steadily". Although the principle of word segmentation units in information processing industry is "integrated closely and used steadily", no standard is carried out in practice. So, machines can not work according to the principle.In order to solve this problem, this paper aims to find a method which can automatically extract such phrases which are integrated closely and used steadily.Based on Dcc of the Plain Media Branch of the National Language Resource Monitor Center, this paper selects texts, amounting to 142,069 texts and 216,154,807 bytes in the Education field from 2006 to 2008 which are composed of 15 mainstream newspapers to carry out research in order to find an efficient method to automatically extract fixed phrases which are "integrated closely and used steadily". In this paper, definition of fixed phrases which are "integrated closely and used steadily" is put forward first. With the help of statistic method and regulation method, this paper judges the candidate strings amounting to 24,116,507 from the perspectives of frequency, mutual information, entropy, syntax. Then, this paper evaluate whether a possible phrase after the procedures mentioned above is legal or not in semantics. Finally, after diachronic analysis, the 660 fixed phrases which are "integrated closely and used steadily" are extracted. In this way, this paper puts forwards a basic method to implement the principle of close integration and solid usage. Meanwhile, this paper provides an approach to investigate whether a phrase is "integrated closely and used steadily" or not.Main contents in this paper:â—‡Extraction of high frequent seed words and fix of the length of candidate strings of fixed phrases.Extraction of high frequent seed words should go through steps of texts pretreatment which transfers texts in the form of html into the form of pure texts and classifies texts into different fields, of part of segment which means divide a sentence into words. The segment software is offer by Mr. Zhao Jun, an associate research fellow of the Institute of Automation, Chinese Academy of Science. The texts pretreated are 25,00,169 texts and 3,614,364,074 bytes. The texts segmented in the fields of education, economy, entertainment, sports are 921,529 texts and total bytes are 1,213,283,890. According to the method of ID comparison, 5000 words in the field of education are extracted as high frequenct seed words. With experiment, the length of candidate string is observed, which is from 2 to 5. With high frequency seed words in the field of education, candidate strings are extracted, which amount to 24,116,507 items.â—‡Filtration of candidate strings with statistic parameters such as frequency, mutual information, entropy. In this procedure, 16,896 strings are extracted from 24,116,507 items in three years.â—‡Filtration of candidate strings with syntax rules.With the help of syntax rules, five classes of colligations are set, which are "a+n,n+v,n+n,v+n,v+v". From 16,896 items, 785 items are extracted according to syntax rules after merging the same candidate items.â—‡Based on Hownet, the candidate items are tested in semantics. The final result is 785.â—‡By diachronic test, this paper evaluate whether a candidate string is used steadily or not.This paper observes 785 items of candidate strings. Among these strings, 660 items are selected as final fixed phrases.The initial points and main work in this paper are as following:â—‡The fixed phrases which are "integrated closely and used steadily" is defined in the paper. Based on DCC, texts in education field are dealt to extract strings of 2 units. Then candidate strings are filtered in statistic parameters, syntax rules, semantic regulation, and diachronic observation. Finally, fixed phrases amounting to 660 are extracted.â—‡The principle -"close integration and steady usage" - is set in detail. In multi-character frame, close integration is tested in statistic parameters, syntax rules, and semantic way. Steady usage is tested in diachronic point.â—‡An extraction method is put forward to evaluate candidate strings from quantity to quality.â—‡The extraction method can be adopted by other fields. Moreover, this method can be helpful in adding more segmentation units which is "integrated closely and used steadily", can set phrases in language monitoring field, can offer materials for Chinese language linguistics and lexical dictionary. Hence, the method and fixed phrases extracted by the method can be useful in many fields.
Keywords/Search Tags:close integration, steady usage, Dynamic Circulating Corpus, special field, entropy, diachronic observation, syntax rules, semantic
PDF Full Text Request
Related items