Font Size: a A A

Research On Chinese Word Embedding Model Integrated With Sub-character Semantics

Posted on:2023-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:W LuFull Text:PDF
GTID:2558307172457674Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Chinese processing technologies are applied to numerous scientific and technological fields.Converting Chinese words into embeddings helps computers understand users’ intentions and facilitate information exchange,laying the foundation for natural language processing tasks.However,current Chinese word embedding models ignore the semantic relevance inside words,influencing the quality of Chinese word embeddings.Therefore,it is a key task in the field of Chinese natural language processing to learn Chinese word embeddings with internal semantic relevance and improve the quality of Chinese word embeddings.Chinese characters consist of sub-characters,including radical,component,stroke,etc.Similar to roots in English words,sub-characters indicate the origins and basic semantics of Chinese characters.Some research endeavors to learn Chinese word embeddings using subcharacters,but ignores that some Chinese characters consisting of the same sub-characters have different meanings,leading to the deviation from the semantics of words.Furthermore,many neologisms,such as names,transliterated loanwords and network terms,are ambiguous and hard to identify,posing challenges for learning Chinese word embeddings.In order to address aforementioned issues,based on the hierarchical structure of Chinese words,the method of expressing the semantics of Chinese words is studied,and a Chinese semantic weighted graph with a weight assignment mechanism is proposed to express the semantic relevance among words,characters and sub-characters,thus improving and optimizing the method of expressing the semantics of Chinese words.On this basis,the method of learning Chinese word embeddings with internal semantic relevance is studied,and the Chinese word embedding model,inside CC,integrated with sub-character semantics is proposed.With the Chinese semantic weighted graph as an input,inside CC can reveal the semantic relation among different language components.It can also incorporate the semantic information and semantic relevance of different language components,and learn the embeddings of words including neologisms,so as to improve the quality and semantic expressing capability of Chinese word embeddings.In order to evaluate the quality of Chinese word embeddings learned by inside CC,extensive experiments on multiple training corpora and datasets have been carried out.Experimental results indicate that compared with the state-of-the-art Chinese word embedding models,inside CC can achieve a maximun improvement of 6.42% in semantic similarity experiments and 9.18% in analogy reasoning experiments,respectively.All the results verify that inside CC outperforms the state-of-the-art counterparts.
Keywords/Search Tags:Natural language processing, Chinese word embedding, Chinese sub-character, Semantic relevance
PDF Full Text Request
Related items