Font Size: a A A

Word Embedding Revision Based On Morphological Information And Semantic Lexicons

Posted on:2018-07-15Degree:MasterType:Thesis
Country:ChinaCandidate:J W LiuFull Text:PDF
GTID:2348330518997743Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
According to the weaknesses of traditional word embedding models, which can only capture word-level semantic information, thus neglecting the semantic meanings of words’ internal structures and can not effectively distinguish antonyms, this paper ex-tracts two scientific questions: 1、 how to incorporate morphological information into the training process of word embeddings to improve semantic similarity and morpholog-ical similarity of word embeddings;2、how to improve the ability of word embeddings to distinguish antonyms.For scientific question 1, this paper takes English as an example and incorporates morphological information of English words, which includes prefix, suffix and root,into the training process of word embeddings. Based on this, this paper proposes two implicit morpheme-enhanced word embedding models including Average Model (AM)and Similarity Model (SM). Compared with the related work, our implicit models show great differences. The related work directly utilizes morphemes to improve the quality of word embeddings while our models use the morphemes’ meanings to model word embeddings. The advantage of our models is that they can improve both semantic simi-larity and morphological similarity of word embeddings. This paper utilizes word sim-ilarity, syntactic analogy and N nearest words to test the ability of our models. The results show that the implicit models get the best performance on all tasks. Parameter analysis indicates that morpheme-similar words locate near each other and are close to their morphemes’ meanings in the vector space of implicit models. In addition, com-pared with the performance of the basic models on a large corpus, the implicit models can supplement semantic information and achieve similar performance on a small cor-pus.For scientific question 2, based on semantic lexicons, this paper proposes a model named LWET (Lexicon-based Word Embedding Tunning model). This model utilizes synonymous and antonymous relations in semantic lexicons to tune the distributions of word embeddings in vector space so that the ability of word embeddings to distin-guish antonyms can be improved. For a target word, the goal of LWET is to make its synonyms locate near it,while its antonyms are far away,and the irrelevant words act as a boundary locating somewhere between antonyms and synonyms at the same time. For reducing the computational complexity, we propose two approximation algo-rithms including positive sampling and quasi-hierarchical softmax. Positive sampling can achieve lower computational complexity while quasi-hierarchical softmax performs better. This paper uses antonym recognition, distinguishing antonyms from synonyms and word similarity to test the ability of LWET. In experiments, antonym recognition and distinguishing antonyms from synonyms are used to test the ability of word em-beddings to detect antonyms. The result shows that word embeddings can effectively detect the contrast meanings after the tunning process of LWET. The result of word similarity shows that LWET will not do harm to the semantic structure of pre-trained word embeddings when tuning the distributions of word embeddings.
Keywords/Search Tags:Natural Language Processing, Word Embedding, Morphology, Antonym, Positive Sampling, Quasi-Hierarchical softmax
PDF Full Text Request
Related items