Research On Distributed Representation Learning Of Chinese Word

Posted on:2015-12-07

Degree:Master

Type:Thesis

Country:China

Candidate:X Q Hou

Full Text:PDF

GTID:2295330461983862

Subject:Probability theory and mathematical statistics

Abstract/Summary:

PDF Full Text Request

Word representation is one of key issues in natural language processing. It is important prerequisite for architecture of syntactic and semantic analysis models. And it also affects the accuracy and robustness of many NLP application systems, including information retrieval system and question-answering system. Furthermore, when processing large-scaled real-world Chinese dataset, word representation method plays a key role in efficiency and performance of a system.There are three kinds of word representation strategies we focus on in this thesis:one-hot representation, distributional representation based on latent semantic information and distributed representation based on neural language model. The one-hot representation is most widely used in Chinese information processing. Many Chinese chunking systems, which are based on maximum entropy model and conditional random fields, introduce word-related one-hot vectors as features. This representation is simple but high-dimensional, and the corresponding feature matrix is very sparse. In order to make up for this shortcoming, the latter two representation strategies map words into low dimensional real-valued vectors. The difference between the latter two representations is that the distributional representation mainly employs some matrix decomposition techniques, while the distributed representation regards the word vectors as a hidden layer in neural network.We focus on the word representation strategy based on neural network. We only care about the neural language model proposed by Bengio(2003) and conduct some numerical simulations based on a large Chinese corpus of 5 million characters, which is processed manually by Shanxi University. The simulations show that maximum element and minimum element in the matrix of the distributed word representation become larger and lower respectively when iteration number increases. This phenomenon is in line with the results derived in Turian(2010). Furthermore, we analyze this phenomenon theoretically, and give a sufficient condition of unbound state of the matrix.In this thesis, we also study connections between the distributed word representation and word meanings. By drawing histograms of vectors of some typical English and Chinese polysemous words, we preliminarily assert that the more ambiguous the word is, the more peaks the histogram has. And we observe that both Chinese words and English words show the similar trend.In order to compare the distributional representation and the distributed representation, we conduct some word-clustering experiments with two representations. The experiments illustrate that the distributed word representation can find more precise ten neighbors for a word than the distributional word representation.The one-hot representation and the distributed representation are compared in boundary identification task of Chinese Base-Chunk. The results show that F value is 38.72% when using sliding-window word features of size [-2,2], and the F value is up to 70.51% when replacing original one-hot word features to the distributed word features; When we scales the distributed word features, the F value can achieve 70.74%. After we further introducing Part-Of-Speech features of window size [-2,2], the F value is 82.35% and 85.90%, which are corresponding to the one-hot word feature and the distributed word feature respectively. These results indicate that the distributed representation of Chinese words has positive effects on the identification task of Chinese Base Chunk.

Keywords/Search Tags:

word representation, neural language model, distributed word representation, Chinese Base Chunk

PDF Full Text Request

Related items

1	A Reexamination Of Bilingual Mental Lexicon And Its Implications For L2 Word Meaning Acquisition
2	The Eye Movement Study On Mathematical Word Problems Representation Strategy In Four Grade Uighur Bilingual Students
3	The Effects Of Pupils' Mathematical Word-problem Representation Type On Problem-solving
4	A Neurocognitive Perspective On The Aspects Of Word Meaning
5	Word Association Patterns In The Mental Lexicon For Chinese Learners Of English As Second Language With Different Levels Of Language Proficiency
6	The Development Of Chinese Lexical Representation And The Mechanism Of Word-Generating Errors By CSL Learners Of English Background
7	A Case Study On The Secondary School Pupils' Process Of The Representation In Mentality About Solving Irregular Word Problem Relating To Buoyancy
8	The Activation Process And Representation In Visual English Word Recognition: Evidence From Chinese-English Bilinguals
9	Research On The Mental Representation Of Chinese Two-Character Word And The Application In Teaching
10	Reserches On The Mental Representation And Processing Of Chinese Polymorphemic Words