A Similarity-based Tibetan Word Co-occurrence Network Construction Technology And Feature Analysis

Posted on:2022-02-25

Degree:Master

Type:Thesis

Country:China

Candidate:Y D Z Jia

Full Text:PDF

GTID:2515306482473434

Subject:Chinese Ethnic Language and Literature

Abstract/Summary:

PDF Full Text Request

Language and writing are the products of the long-term evolution of human beings and the continuous development of civilization.This is a complex network system developed over a long period of time.Even when using popular language methods to learn and explore the language,it is difficult to find out the relationship and overall characteristics of the language’s internal network.The language co-occurrence network uses complex network technology to explore the characteristics of human language,which can further reveal the internal structural relationship of language and characters.Scholars from various countries have done a lot of work on the co-occurrence network of English and Chinese languages,and have achieved fruitful results,which have been applied to various language and word processing tasks.Compared with English and Chinese language co-occurrence network technology,the construction and feature analysis technology of Tibetan language co-occurrence network is in its infancy.Its research can reveal the internal structure of Tibetan language and characters,and has a wide range of applications in Tibetan information processing.value.By analyzing the structure of similarity co-occurrence network building modules,this article proposes a method for constructing Tibetan word co-occurrence based on similarity.This method uses words as network nodes and constructs a word co-occurrence network based on the connection edges between similar words.The co-occurrence network of Tibetan words we constructed analyzes its network characteristics.The main contents include:(1)Established an experimental corpusThere is currently no unified experimental corpus for the construction of Tibetan word co-occurrence networks.Therefore,we obtained 18.07 M Tibetan text corpus containing 1258980 entries from Tibetan websites and electronic documents,and performed After preprocessing,a high-quality experimental corpus was obtained for constructing Tibetan word co-occurrence network.(2)Tibetan word vector representation and similarity calculationWith the deepening of deep learning,word embedding can systematically and perfectly show the semantic relationship between words.When constructing a similarity-based co-occurrence network of Tibetan words,it is necessary to express Tibetan words as word vectors and calculate the similarity between words.Based on the Tibetan sub-word corpus constructed,the article uses the CBOW model,which has a better training effect on small corpus,to train word vectors,and uses cosine similarity to calculate the similarity between words.(3)Construction method and characteristic analysis of Tibetan word co-occurrence network On the basis of analyzing the co-occurrence network construction technology of languages such as English,Chinese,etc.,we use the Tibetan word co-occurrence network construction method proposed by us to construct the word co-occurrence network,and experimentally verify the construction of the Tibetan word co-occurrence network based on similarity.The validity of the method is analyzed,and the statistical characteristics of the co-occurrence network of Tibetan words are analyzed.

Keywords/Search Tags:

NLP, Tibetan, Word vectors, Similarity, Co-occurrence network

PDF Full Text Request

Related items

1	Construction And Application Of Mongolian Word Co-occurrence Network Based On Text Big Data
2	Research On Lexical Level Knowledge Mining Based On Corpus
3	The Development Research Of Chinese Interlanguage Verbs Of Japanese Learners Based On Word Co-occurrence Network
4	A Systematic Empirical Analysis Of Chinese Interlanguage Based On Word Co-occurrence Network
5	Neural Machine Translation Research Based On The Semantic Vector Of Tri-lingual Parallel Corpus
6	Text Analysis Of Speech Synthesis Based On Statistical Parameters Of Tibetan Language In Specific Fields
7	Research On Representation And Evaluation Of Tibetan Word Vector
8	Research On Sentiment Classification Technology Of Tibetan Text
9	Research On Word Segmentation And Part-of-speech Of Tibetan On Neural Network
10	Cambodian Named Entity Recognition Based On The Topic Model Word Vector