Font Size: a A A

Lexeme Extending For Distributed Representations Based On Knowledge Base

Posted on:2018-11-14Degree:MasterType:Thesis
Country:ChinaCandidate:J C ChenFull Text:PDF
GTID:2348330518995431Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
In recent years, distributed representations have been widely used in natural language processing tasks such as POS tagging, machine translation, word sense disambiguation, and have shown the superiority of their effect. However, word embeddings, which are trained through unsupervised learning on the large-scale text corpus, are difficult to cover the perfect annotation information in knowledge base and semantic dictionary, and the entity relations and unique hierarchical structure of knowledge base are certainly important in NLP tasks. How to integrate distributed representations and knowledge base effectively to promote word embeddings, has aroused heated discussion in research, so as to solve the problem of polysemy, lexeme (a certain sense of a word) extending and so on. Our paper is also based on this task.Based on the autoencoder framework in AutoExtend[18]model, a new semi-supervised and hierarchical word embedding model is proposed to fuse the knowledge base and the unsupervised word embedding, ending up in optimizing effects of polysemy, expansion and other tasks. Although AutoExtend model and related research have achieved some success, they only utilized limited knowledge base entity relations and were only valid for in-vocabulary terms. So in this paper we make full use of knowledge base features to implement and propose two models, to improve the word and the lexeme embeddings with the guarantee of efficiency and parallelization. The experiment is carried out through tasks like word similarity, word expansion and named entity recognition, which has shown that our model performs better. The main contributions of this paper are displayed as follows:(1) A model named RetroExtend is achieved based on semantic dictionary through semi-supervised learning, which is to improve in-vocabulary word and lexeme embeddings. The model follows the autoencoder framework in AutoExtend, analyzing the entity relations among words and lexemes in the encoding-decoding process through graph-based learning.(2) A model named OOVExtend is proposed based on hierarchical structure through semi-supervised learning, which is to solve the polysemy problem of out-of-vocabulary words and to extend corresponding lexeme embeddings. In this model we first utilize the hierarchical structure of the knowledge base to match the most close synsets in the knowledge base for a certain OOV word, and then use the synset embeddings learned by the RetroExtend model to calculate the OOV lexeme vectors, which is realized by learning weight matrix in minimizing the reconstruction loss in the decoding process of transferring the matched synset vectors to the original word vectors.(3) Based on the models of this paper, we combine the knowledge base like WordNet, PPDB and corpus like Wikipedia and GoogleNews to solve tasks like word similarity, word expansion and named entity recognition. The results are evaluated on standard datasets like WS353,SCWS, with the method of AvgSim and AvgSimC. Compared to the existing models, the experimental results show that our model performs better. For example, Spearman correlation is improved by 2%~3% in terms of the word similarity task. In sum, our algorithms are feasible and can promote the performance of solving the issues such as polysemy and lexeme extending.
Keywords/Search Tags:word embedding, knowledge base, lexeme extending
PDF Full Text Request
Related items