| Text structure is important research topics in the field of natural language processing.Effective feature representation a not only save human resources,but also greatly improve the performance of subsequent tasks.However,the construction of Korean language database is time-consuming and laborious work,and the relative researchs of Korean language information and intelligent processing are scarce.So the work for Korean text feature learning has important academic significance and research value.Combining deep reinforcement learning with self-attention mechanism,the thesis propose two kinds of Korean sentence feature learning models to constructs a good feature representation from the perspective of representation learning based on the unique characteristics of Korean language corpus,which can improve the performance of the following tasks.First of all,more than 30,000 abstracts of Korean scientific and technological documents with 13 different tags are collected and the original text data set with sentences as the unit is constructed.In the data set,there are lots of technical terms and it is hard to distinguish the samples’ label.After cleaning the original data,the granularity segmentation experiment and the word vector training experiment are carried out in turn.In which the intuitiveness of the initial information can be enhanced and the word segmentation error of the adhesive language can be avoided in the granularity segmentation experiment.The word vector training experiment can train suitable text vector representation which combines different semantic contexts.After screening and analysing,the processed data is taken as the model input.Secondly,two structured representation models are proposed: the Information Distilled Attention(ID-Attention)and the Hierarchically Structured Attention(HSAttention).The ID-Attention model can learn whether to retain or delete the words of the sentence,so that the strategy network can be fully trained to avoid the errors caused by word segmentation tools or deactivating words,HS-Attention model can learn and adjust the internal structure of Korean sentence,and then gain the text vector representation with structural information.Both models use reinforcement learning to update the structural representation of sentences and obtain higher classification accuracy in text classification experiments.Finally,the two models feedback the classification accuracy which obtained by their respective classification networks to the strategy network for joint training and to train a better action selection sequence.That is,ID-Attention model can identify the important words in the Korean sentence and skillfully remove the stop words;HS-Attention model can divide the hierarchy of sentences and make the features of sentences more obvious.The structured task of text is transformed into sequence decision task in this thesis,so that the model can recognize words and partition structure in Korean language data without explicit structure annotation and greatly reduce the dependence of Korean language structure on manual tagging.From the indirect evaluation index,the classification accuracy of the model is improved compared with the classification model based on statistics,sequence structure,attention and reinforcement learning.The classification accuracy is improved by 1.46% compared with the HS-LSTM model based on reinforcement learning,and it is improved by 2.2% compared with the attention-based model(Self-attention).At the same time,judging from the direct evaluation index,the Korean sentence feature representation obtained by the model get a good expert score.The experimental results show that the important text features of Korean are close to manual annotation and can be identified from the two models proposed in this thesis,which has a good auxiliary effect on Korean informatization and intelligentization. |