Font Size: a A A

Research And Implementation Of Tibetan Text Classification Based On Ada Boost Model

Posted on:2020-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:H Y JiaFull Text:PDF
GTID:2415330599952147Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
At present,a large number of Tibetan literature resources have been digitized and informatized.Classification of texts by classification techniques is beneficial to the literature management of the literature,and readers can quickly and conveniently query related documents.Due to the complex structure of Tibetan language,in the field of natural language processing,people have relatively short research time.In text classification processing,there is not a relatively mature classification system.The reason is mainly the corpus and model used in classification experiments.Relatively few,even if some models have been experimentally studied,the classification effect is not very satisfactory,which hinders the development of classification technology.Therefore,this thesis collects a certain scale of corpus through the network and combines the current relatively mature machine learning AdaBoost classification model.The text has been researched and implemented.The experimental results show that the model improves the processing ability of Tibetan texts and has good classification performance.On the basis of studying the classification of domestic and foreign texts,combined with the characteristics of Tibetan language itself,the multi-category samples with different numbers and the more recognized multi-type features are used as the data source of the model,and the relatively mature classification model in machine learning is At the core,the Tibetan text classification system based on AdaBoost model is established and the expected results are achieved through testing.The research results of this thesis are as follows.1.Because the Tibetan corpus of current research and experiment is relatively small,more than 70,000 corpora of this thesis are collected by individuals and divided into 7 categories.Then,through text preprocessing,a total of 4392 normative samples are formed,and the sample set is finally completed Construction work.2.Using N-gram and words as the object of feature extraction,using feature frequency sorting algorithm,information gain algorithm,information gain adding algorithm and forward stepwise regression algorithm,100 or so from tens of thousands of features are obvious.The characteristics of category differentiation are used as features in the experiments in this thesis,which improves the classification efficiency of the model.3.The research and experiments on KNN,GaussianNB,Logistic regression and SVM conventional classification model are carried out for the early stage of the construction of the strong classification model.The above classification model has a stable classification performance.4.By learning the principle of text classification by AdaBoost model,this thesis proposes to use the four machine learning classification models listed in(3)above to replace the original AdaBoost classification model with the iterative algorithm to obtain the weak classification model.Eleven AdaBoost classification models were generated and the results of 5-CV experiments showed that the classification accuracy and recall rate of 11 classification models characterized by one symbol,two symbols and words reached more than 90%,and the lowest three codes.The classification accuracy rate and recall rate of the meta-feature model also reached 88%,among which the classification accuracy and recall rate of the AdaBoost model,which is characterized by one symbol and based on four machine learning models,reached 96% and 95%,respectively.It shows that the model has good classification performance.5.Using the AdaBoost classification model to change the algorithm principle,a relatively complete classification system is designed,and the classification performance of the model is demonstrated by an intuitive interface.With the continuous development of natural language processing technology,text classification technology is becoming more and more mature,but the related research on Tibetan text classification is still in its infancy,and there are relatively few research experiments.Based on the classification theory research,Through the exploration of the classification model,the experimental data is obtained.Therefore,the research results of this paper have certain reference and reference value for the subsequent research.
Keywords/Search Tags:tibetan text classification, feature processing, adaboost classification model, classification system
PDF Full Text Request
Related items