| As an important technology in natural language processing,the result of Chinese word segmentation determines the effect of follow-up text processing to a large extent.In the field of ideological and political education,the field vocabulary has the characteristics of fast birth of new words,wide coverage,large vocabulary and so on,which causes great difficulties for word segmentation and follow-up work in the field.To solve this problem,this paper designs a word segmentation system for ideological and political education based on unsupervised learning.The system constructs a corpus based on domain literature,trains the word level language model combined with the statistical language model,and uses Viterbi algorithm to realize the preliminary Chinese word segmentation.Finally,the result of word segmentation can be optimized by the Chinese word segmentation optimization algorithm based on the word frequency deviation.The system provides users with the functions of extracting keywords,word frequency statistics,drawing word cloud map and so on based on the segmentation results,so as to realize the Chinese word segmentation and text analysis of domain literature.The main work and achievements are as follows:(1)Referring to the training process of the traditional word level n-gram language model,this paper obtains the word level language model.Based on the language model,we use Viterbi algorithm to find the optimal segmentation path.Considering the long characteristics of professional vocabulary,we add the word segmentation optimization algorithm based on the word frequency deviation to reorganize the preliminary segmentation results,and finally output the optimal segmentation results.(2)In the process of building the corpus,we consider the coverage of vocabularies,including professional vocabularies,hot vocabularies and common vocabularies.We use reptiles and other means to obtain domain literature to build the corpus,so as to ensure that the corpus contains all the common vocabularies in the domain as much as possible.(3)According to the needs of text processing,TF-IDF algorithm is implemented in the system to provide the function of extracting key words.Through the function of word frequency statistics,the number of words in the text is counted,the word cloud map is drawn to reflect the secondary relationship of words in the text,and the research hotspot analysis function draws the amount of documents in 20 years for different words to meet the needs of text analysis.(4)The whole scheme of the word segmentation system of ideological and political education is designed and realized.The front end is mainly written in Py Qt5,and the back end is mainly implemented in Python.The scheme includes the structure design of the word segmentation system of ideological and political education and the function design of each module.This system aims at the field of ideological and political education,the experimental results show that the construction of the ideological and political education segmentation system based on the corresponding ideas improves the accuracy of Chinese word segmentation and text analysis efficiency in the field of ideological and political education,and plays a certain role in promoting the research of documents in the field of ideological and political education. |