Research On Chinese Word Embedding Model Integrated With Sub-character Semantics

Posted on:2023-07-16

Degree:Master

Type:Thesis

Country:China

Candidate:W Lu

Full Text:PDF

GTID:2558307172457674

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

Chinese processing technologies are applied to numerous scientific and technological fields.Converting Chinese words into embeddings helps computers understand users’ intentions and facilitate information exchange,laying the foundation for natural language processing tasks.However,current Chinese word embedding models ignore the semantic relevance inside words,influencing the quality of Chinese word embeddings.Therefore,it is a key task in the field of Chinese natural language processing to learn Chinese word embeddings with internal semantic relevance and improve the quality of Chinese word embeddings.Chinese characters consist of sub-characters,including radical,component,stroke,etc.Similar to roots in English words,sub-characters indicate the origins and basic semantics of Chinese characters.Some research endeavors to learn Chinese word embeddings using subcharacters,but ignores that some Chinese characters consisting of the same sub-characters have different meanings,leading to the deviation from the semantics of words.Furthermore,many neologisms,such as names,transliterated loanwords and network terms,are ambiguous and hard to identify,posing challenges for learning Chinese word embeddings.In order to address aforementioned issues,based on the hierarchical structure of Chinese words,the method of expressing the semantics of Chinese words is studied,and a Chinese semantic weighted graph with a weight assignment mechanism is proposed to express the semantic relevance among words,characters and sub-characters,thus improving and optimizing the method of expressing the semantics of Chinese words.On this basis,the method of learning Chinese word embeddings with internal semantic relevance is studied,and the Chinese word embedding model,inside CC,integrated with sub-character semantics is proposed.With the Chinese semantic weighted graph as an input,inside CC can reveal the semantic relation among different language components.It can also incorporate the semantic information and semantic relevance of different language components,and learn the embeddings of words including neologisms,so as to improve the quality and semantic expressing capability of Chinese word embeddings.In order to evaluate the quality of Chinese word embeddings learned by inside CC,extensive experiments on multiple training corpora and datasets have been carried out.Experimental results indicate that compared with the state-of-the-art Chinese word embedding models,inside CC can achieve a maximun improvement of 6.42% in semantic similarity experiments and 9.18% in analogy reasoning experiments,respectively.All the results verify that inside CC outperforms the state-of-the-art counterparts.

Keywords/Search Tags:

Natural language processing, Chinese word embedding, Chinese sub-character, Semantic relevance

PDF Full Text Request

Related items

1	A Representation Method Of Chinese Characters And Words Based On Word-Character Alignment
2	Research On Multi-granularity Chinese Word Embedding Based On Glyph Structure
3	Research On Chinese Word Segmentation Integrating Pinyin And Tone Information
4	Research On Chinese-Oriented Hybrid Embedding Text Representation Method
5	Research On Chinese Word Sense Disambiguation Based On Semantic Analysis
6	Research On Chinese Word Segmentation Based On Deep Learning
7	The Methodology And Implementation Of Chinese Natural Language Query In Databases
8	Visual Analysis System Of Chinese Natural Language Processing Model
9	Study On Chinese Word Segmentation Based On Recurrent Neural Network Language Model
10	OCR Error Post-correction Based On Chinese Character-level Features And Language Model