Research And Implementation Of Tibetan Word And Document Vector Representation

Posted on:2024-06-05

Degree:Master

Type:Thesis

Country:China

Candidate:J K Z De

Full Text:PDF

GTID:2555307085470814

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The goal of natural language processing is to equip computers with the same language comprehension ability as humans.Vector representation techniques can vectorise text data,thus allowing computers to automatically mine patterns in large-scale unlabelled text data and obtain important semantic information,so that machines can have the same language comprehension ability as humans.With the development of deep learning,Tibetan word and document vector representations have also obtained many results,but Tibetan word and document representations still lack experimental comparisons over English and Chinese,and there is no publicly available Tibetan word vector corpus and evaluation set,so indepth research on Tibetan word and document vector representations is still needed.This thesis firstly dissects the structure of existing word vector models,explains the connection between the models,classifies and organizes both static word vector models and dynamic word vector models and summarizes Tibetan word vector evaluation methods,finally this thesis describes and conducts relevant experiments on various types of deep learning models required for Tibetan text classification tasks,at the same time this thesis conducts relevant research on Tibetan word vectors and document representation,the specific research work is as follows:1.The Tibetan data required for the Tibetan word vector and document representation were collected and pre-processed to construct a dataset containing 104,367 Tibetan texts.2.The four word vector training models,from the static word vector Word2 vec to the dynamic word vectors ELMo,BERT and ALBERT,are experimented with using the dataset constructed in this thesis.In order to verify the effect of data volume on the Tibetan word vector models,the data set is divided into four data volume ratios and experiments are conducted on each of the four word vector models on the four different data volume data sets.To verify the effect of the corpus domain on the word vector models in the dataset,experiments are conducted to show that due to the scarcity of Tibetan corpus,the training of word vector models for specific tasks can achieve better results.3.In this thesis,a vector representation of Tibetan documents based on the Bi GRU-Text CNN model is proposed,which can generate vector representations of Tibetan documents flexibly and relatively quickly and can be applied to text classification tasks.The model is also compared with the Tibetan word vector model for text classification,and the experimental results show that there is still much room for improvement in the study of Tibetan document vector representation,and the model’s ability to construct Tibetan documents needs to be further improved.4.In this thesis,four Tibetan word vector models are obtained through relevant experiments,among which the Tibetan ALBERT model is the most excellent.In order to enable the researcher to obtain a representation of Tibetan word vectors,a Tibetan word vector generation system is designed and implemented in this paper for application.

Keywords/Search Tags:

Tibetan, word vector representation, document representation, word vector generation system

PDF Full Text Request

Related items

1	Research On Representation And Evaluation Of Tibetan Word Vector
2	Research On Chinese Word Vector Based On Internal Information Of Words
3	Research On Generalized Model Of Chinese Couplet Based On Recurrent Neural Networks
4	Design And Software Development Of Couplet Generation Model Based On Word Vector And Attention Mechanism
5	Research On Indonesian Similar News Recommendation Based On Neural Network
6	Research On Cross-language Related Patent Recommendation Based On Representation Learning
7	Research On Sentiment Classification Technology Of Tibetan Text
8	Research On Distributed Representation Learning Of Chinese Word
9	Research On Deep Learning Automatic Composition Based On MIDI Music
10	Word Sense Disambiguation Of English Modal Verb Will By Support Vector Machines