| The goal of natural language processing is to equip computers with the same language comprehension ability as humans.Vector representation techniques can vectorise text data,thus allowing computers to automatically mine patterns in large-scale unlabelled text data and obtain important semantic information,so that machines can have the same language comprehension ability as humans.With the development of deep learning,Tibetan word and document vector representations have also obtained many results,but Tibetan word and document representations still lack experimental comparisons over English and Chinese,and there is no publicly available Tibetan word vector corpus and evaluation set,so indepth research on Tibetan word and document vector representations is still needed.This thesis firstly dissects the structure of existing word vector models,explains the connection between the models,classifies and organizes both static word vector models and dynamic word vector models and summarizes Tibetan word vector evaluation methods,finally this thesis describes and conducts relevant experiments on various types of deep learning models required for Tibetan text classification tasks,at the same time this thesis conducts relevant research on Tibetan word vectors and document representation,the specific research work is as follows:1.The Tibetan data required for the Tibetan word vector and document representation were collected and pre-processed to construct a dataset containing 104,367 Tibetan texts.2.The four word vector training models,from the static word vector Word2 vec to the dynamic word vectors ELMo,BERT and ALBERT,are experimented with using the dataset constructed in this thesis.In order to verify the effect of data volume on the Tibetan word vector models,the data set is divided into four data volume ratios and experiments are conducted on each of the four word vector models on the four different data volume data sets.To verify the effect of the corpus domain on the word vector models in the dataset,experiments are conducted to show that due to the scarcity of Tibetan corpus,the training of word vector models for specific tasks can achieve better results.3.In this thesis,a vector representation of Tibetan documents based on the Bi GRU-Text CNN model is proposed,which can generate vector representations of Tibetan documents flexibly and relatively quickly and can be applied to text classification tasks.The model is also compared with the Tibetan word vector model for text classification,and the experimental results show that there is still much room for improvement in the study of Tibetan document vector representation,and the model’s ability to construct Tibetan documents needs to be further improved.4.In this thesis,four Tibetan word vector models are obtained through relevant experiments,among which the Tibetan ALBERT model is the most excellent.In order to enable the researcher to obtain a representation of Tibetan word vectors,a Tibetan word vector generation system is designed and implemented in this paper for application. |