Font Size: a A A

A Similarity-based Tibetan Word Co-occurrence Network Construction Technology And Feature Analysis

Posted on:2022-02-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y D Z JiaFull Text:PDF
GTID:2515306482473434Subject:Chinese Ethnic Language and Literature
Abstract/Summary:PDF Full Text Request
Language and writing are the products of the long-term evolution of human beings and the continuous development of civilization.This is a complex network system developed over a long period of time.Even when using popular language methods to learn and explore the language,it is difficult to find out the relationship and overall characteristics of the language’s internal network.The language co-occurrence network uses complex network technology to explore the characteristics of human language,which can further reveal the internal structural relationship of language and characters.Scholars from various countries have done a lot of work on the co-occurrence network of English and Chinese languages,and have achieved fruitful results,which have been applied to various language and word processing tasks.Compared with English and Chinese language co-occurrence network technology,the construction and feature analysis technology of Tibetan language co-occurrence network is in its infancy.Its research can reveal the internal structure of Tibetan language and characters,and has a wide range of applications in Tibetan information processing.value.By analyzing the structure of similarity co-occurrence network building modules,this article proposes a method for constructing Tibetan word co-occurrence based on similarity.This method uses words as network nodes and constructs a word co-occurrence network based on the connection edges between similar words.The co-occurrence network of Tibetan words we constructed analyzes its network characteristics.The main contents include:(1)Established an experimental corpusThere is currently no unified experimental corpus for the construction of Tibetan word co-occurrence networks.Therefore,we obtained 18.07 M Tibetan text corpus containing 1258980 entries from Tibetan websites and electronic documents,and performed After preprocessing,a high-quality experimental corpus was obtained for constructing Tibetan word co-occurrence network.(2)Tibetan word vector representation and similarity calculationWith the deepening of deep learning,word embedding can systematically and perfectly show the semantic relationship between words.When constructing a similarity-based co-occurrence network of Tibetan words,it is necessary to express Tibetan words as word vectors and calculate the similarity between words.Based on the Tibetan sub-word corpus constructed,the article uses the CBOW model,which has a better training effect on small corpus,to train word vectors,and uses cosine similarity to calculate the similarity between words.(3)Construction method and characteristic analysis of Tibetan word co-occurrence network On the basis of analyzing the co-occurrence network construction technology of languages such as English,Chinese,etc.,we use the Tibetan word co-occurrence network construction method proposed by us to construct the word co-occurrence network,and experimentally verify the construction of the Tibetan word co-occurrence network based on similarity.The validity of the method is analyzed,and the statistical characteristics of the co-occurrence network of Tibetan words are analyzed.
Keywords/Search Tags:NLP, Tibetan, Word vectors, Similarity, Co-occurrence network
PDF Full Text Request
Related items