Study On Extension Of Unknown Words Based On Cyber Source

Posted on:2013-11-14

Degree:Master

Type:Thesis

Country:China

Candidate:W W Guo

Full Text:PDF

GTID:2248330371999430

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Chinese text is made up of many words, and Chinese word is the most basic research on Chinese information processing. With Internet popularization and development, daily life constantly emerge some words which are not included in the word dictionary, also known as unknown words, and extract the unknown word from the corpus, can enrich human language dictionary, but also can improve the computer recognition of Chinese ability. Unknown word itself has no certain law, there is no uniform standard to define it, therefore, can accurately identify unknown words is the key research projects in the Chinese word segmentation field, but also difficult to break through. This paper is the use of cyber source for recursive expanding unknown word again which is the screening, thereby obtaining no length limit unknown words whose meaning is more complete, improving recognition efficiency of Chinese word segmentation in unknown word. The main researching contents include the following:(1)Systematically introduces the Chinese word segmentation research background, research significance and research status at home and abroad, and simply discussed several domestic and foreign representative word segmentation system.(2)Give describe in detail to several word segmentation algorithm which are used frequently on the Chinese word segmentation domain, including the string-based statistical algorithm, based on statistical algorithm and rule based algorithm, and analyze the process of these several algorithms, introduces several evaluation standard in the Chinese word segmentation system and the difficulties which Chinese word segmentation domain faces.(3) Unknown word recognize algorithm. The unknown word recognition of several commonly used algorithms, which is based on a statistical algorithm, rule-based algorithm and based on the combination of statistical and rule algorithm, elaborated with emphasis the statistic and rule based algorithm, and carries on the process analysis, gives statistical model and rule model, finally through the relevant corpus experiments to carried out and the experimental results were analyzed.(4) Expansion on the unknown word. Introduces the related information which is used to the expansion of unknown word, according to the two concepts which are mutual information and the logarithm likelihood ratio, combining the two concepts, putting forward a kind of extraction algorithm, calculate the value of the formula, the two variables candidate unknown words were screened, then use the cyber source identification the most frequent left neighbor ratio and the most frequent right neighbor ratio of the two variables candidate for unknown words, using the value for the baseline, then use the network to recursive expand the candidate unknown words seeds which are filtered out, thereby obtaining an unknown word which the length is not limited, modified and semantic more complete, The algorithm are compared with the traditional,1the algorithm improves the efficiency of the unknown words.

Keywords/Search Tags:

Chinese word segmentation, the unknown word, identify algorithm, cybersource, expansion

PDF Full Text Request

Related items

1	Research And Implementation Of Chinese Word Segmentation Algorithm
2	Research Of Combined Chinese Word Segmentation Method
3	Comparative Research On Open-Source Chinese Word Segmentation Machines
4	The Research Of Unknown Chinese Work Recognition And Its Application To Chinese Input Method
5	Chinese Word Auto-segmentation Design And Algorithm Realization For Chinese Network Information Retrieval
6	Statistical Learning In Chinese Word Segmentatin And Application-specific Segmentation
7	The Research Of Chinese Word Segmentation Algorithm Based On Dictionary And Probability Statistics
8	The Design And Implementation Of Chinese Word Segmentation System
9	The Study Of Maximum-Match-Based Written Chinese Automatic Segmentation
10	Based On Dictionary And Word Frequency Analysis Of The Unknown Words From The Bbs Of Corpus Recognition Research