Font Size: a A A

Research On Normalization Of Microblog Text Based On Distributed Semantic Representation

Posted on:2018-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:W L WangFull Text:PDF
GTID:2348330515460093Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet,microblog has become an indispensable part of people’s lives.Research for microblog text has become a hot topic in the field of natural language processing(NLP).But there are many the informal style words(ISW)in the microblog text.These informalities challenge the traditional natural language processing technology for microblog and impact on the follow-up NLP tasks of microblog.The normalization of the microblog text is necessary.Through the research,we find that the existing text normalization technology has made some achievements,but it is not perfect:First,research about the normalization of the Chinese text is little.However,the Chinese word segmentation brings challenges to the informal style words detection.Second,the lack of labled data for the normalization of text,the supervised models are limited.Third,The normalization of Chinese text is not comprehensive,and it is lack of attention to the phenomenon of"new use of old words".Fourth,the current researches of the normalization of text do not take advantage of semantic.In view of the above analysis,in this paper,the main researches of normalization of Chinese microblog are summarized as follows:(1)We propose a method combining with statistics and rules to find IS W in the corpus.In this paper,ISW are made up of out of vocabulary(OOV)words(new words)and "old words" in the vocabulary.We find new words from some strings and we use a variety of statistical magnitudes and rules for extend of words,filtering and new word detection.We only use the statistics of the new words to detect new words without any grammar rules and semantic information.(2)In this paper,three problems are solved using word embedding,which are"new use of old words" detection,extending set of ISW and building normalization dictionary.We compare the similarity of two similar meaning words sets of the word in the two semantic spaces to find targets that are "new use of old words".We also extend the set of ISW using similarity of word embedding in microblogs’ semantic space.This method is more efficient for words composed of letters or numbers.We combine two kinds of corpus,using the word embedding to automatically find normalization dictionary in a semantic space.Word embedding can make full use of semantic,so that normalization is more flexible and accuracy.The words that are "new use of old words" are the good complement to the traditional normalization dictionary.(3)we propose a method of normalization of chinese microblog text based on neural network language model.we implement a bi-directional GRU-RNN structure on normative corpus combining with the speech and semantic similarity features to form a normalization model of text based on logarithmic linear framework.This method makes full use of context,pronunciation and semantic,which makes the result more accurate.In conclusion,in this paper,we use the method based on distributed representation and combine normalization dictionary with statistical model for normalization.We improve the traditional methods.
Keywords/Search Tags:New word detection, Distributed semantic representation, New use of old words, Normalization dictionary, Normalization of chinese text
PDF Full Text Request
Related items