| Tibetan is one of the minority languages in China.In the information age,in order to understand minority languages better,so as to understand minority cultures,and promote the development of language-based artificial intelligence,it is necessary to study Tibetan natural language processing.Word vector representation is the basis of various natural language processing tasks.A good word vector representation can help computers understand text information better,thereby improving the effect of natural language processing tasks.Today,the research on Chinese and English word vector representation is relatively mature and there are many open source datasets.While the research on Tibetan word vector representation is still rare,and there are little open source datasets.Therefore,in order to allow computers to understand Tibetan,and apply artificial intelligence to minority languages,this thesis studies the Tibetan-based word vector representation and its evaluation.Firstly,according to the characteristics of words and components in Tibetan,the TCCWE-P model integrating relative position information,the further pretrained TCCWER model and the TCCWE-PR model combined the TCCWE-P model with the TCCWER model are proposed in this thesis.They are based on the multi-primitive joint training model TCCWE of Tibetan word vector representation based on the Word2Vec model.They have different degrees of improvement compared with the basic model before improvement.Secondly,this thesis further proposes the TCCWE,TCCWE-R model,TCCWE-P model and TCCWE-PR model based on the Doc2Vec model,and achieves better results.The results show that Doc2Vec model is more suitable as a basic model to train Tibetan word vector representation.Thirdly,in order to compare the semantic expression ability of each word vector representation model,the word vector representation model is internally evaluated by the evaluation set.Since there is no open-source Tibetan evaluation set,this thesis also proposes a plan to manually construct the Tibetan similarity evaluation set and uses the constructed Tibetan word evaluation set to internally evaluate the effect of the trained Tibetan word vector representation.The evaluation results show that the semantic expression ability of the TCCWE-P-item model based on the CBOW model is better than CBOW model and other models based on the CBOW model,they also show that the TCCWE model based on the Doc2Vec model has the best semantic expression ability.Finally,in order to compare the effect of each word vector representation more comprehensively,the trained word vector representation model is externally evaluated through specific downstream tasks.This thesis compares the effect of text classification before and after mixing the corpus of training word vector representation with the corpus of text classification and under different features.The results show that mixed corpus,word vector representation as text feature and TextRCNN model have the best effect on text classification task.In this thesis,the TextRCNN model with the best effect is selected for external evaluation of the Tibetan word vector models under different improvements.The results show that the TCCWE model trained based on the PV-DM model in Doc2Vec model has the best accuracy on text classification task,and the TCCWE-PR-char model has the most improved accuracy. |