| The intelligent classification of government documents has important practical significance for improving the efficiency of government affairs,which is one of the key directions that intelligentization of government affairs needs to break through.In this thesis,for the first time,deep learning-based natural language processing technology is applied to the field of government document classification.This thesis aims to improve the accuracy of the model,and provides an algorithm solution that meets the application standards for the actual government document intelligent processing application system.The government document processing is a typical multi-label text classification task,and has its own corpus characteristics.This thesis mainly proposes optimization and improvement on the benchmark model from the three aspects of word embedding,text feature extraction,and mining the correlation between official documents labels,and comprehensively proposes the final official document multi-label classification algorithm scheme of a multi-label prediction model that performs high-order modeling on label correlation.Based on the characteristics of the project data set,this thesis mainly performs the following tasks:(1)According to the feature of the small vocabulary in the GDCD dataset and the fact that government documents contain more specific vocabulary in the field,two optimization methods for word embedding in the government document field are proposed.The first is to use the pre-trained word embedding of a large public corpus to expand the word embedding in the government document field.Combining the word vectors in the two domains,we can expand the vocabulary in the document domain,it can also enrich the semantic feature information of the original word vector while retaining the context semantics in its own domain.The second is to directly use Bert pretrained document field word vectors to deeply mine context semantic information in the government document field.(2)Government document titles are often clearly structured.This means that the titles of all official documents contain a highly generalized summary vocabulary of the body content,so that the auxiliary feature extraction of the title information can be introduced during context encoding,and the text information with more prominent key information can be obtained to improve the prediction accuracy.This thesis uses the collaborative attention of the title and the body to capture the relationship between the title and the body,and obtains contextual coding with title awareness.Without considering the correlation of sample labels,the first-order classification model TA-LEM is proposed by combining the above improvements,which verifies the effectiveness of word vector optimization and feature extraction improvement on our task.(3)Considering the help of multi-label correlation for label prediction,this thesis builds two multi-label classification models that combine the above improvements by taking into account the higher-order relationships between labels: a.The TA-SGM model uses the label sequence generation method to model the semantic association between the labels;b.The GML-GCN model treats each classifier as a node in the graph,and uses the liquidity in the graph structure to model the inherent correlation between the labels.The experimental results show that the effects of the two models are better in government document processing than the traditional multi-label classification model and both meet the requirements of project acceptance. |