| The pathology diagnosis report is a valuable data resource in the field of intelligent healthcare.Pathology diagnosis reports typically appear in the form of unstructured natural language text,and attribute extraction techniques can be used to achieve structured content extraction.In the past,when performing attribute extraction on medical diagnostic texts,only a single granularity was involved,assuming that there was no hierarchical inclusion relationship between attributes.This article regards the research object as three granularities: "specimen","lesion",and "attribute",and regards "specimen" and "lesion" as special attributes,conducting multi-granularity attribute extraction.In addition,due to the special professional nature of medical field texts,the annotation cost of such data is high,so the amount of high-quality labeled data is small.This paper adopts a semi-supervised learning method to solve the problem of small sample size in diagnostic text.The main content of this study is as follows:(1)A multi-granularity attribute extraction model,MGAE-MT-NUM-REG,based on multi-task learning,is proposed to extract hierarchical multi-granularity attribute values from pathology diagnosis text.Firstly,based on the ERNIE-Bi LSTM-CRF basic model and multi-task learning paradigm,the model parameters are shared in the ERNIE layer and Bi LSTM layer.Then,a CRF layer unique to each granularity is connected in the decoding layer to perform multi-granularity attribute extraction.Secondly,according to the number characteristics of each granular attribute entity and the hierarchical relationship between each granular attribute,the number prediction auxiliary task of "specimen" and "lesion" granularity is added,and a regularization penalty loss is introduced to further improve the overall performance of the model.The experimental results show that compared with single-granularity single-task models,Compared with the single-task model with a single granularity,the performance of the multi-task model is improved,with F1 values improved by 3.87%,6.54% and 2.00% on the specimen,lesion,and attribute granularity,respectively.(2)A semi-supervised small-sample multi-granularity attribute extraction model,Semi-MT-MGAE,based on semi-supervised learning,is proposed to address the small sample problem.The model utilizes both labeled and unlabeled data for training.The model consists of a supervised module and an unsupervised module.The input of the supervised module is labeled data,and the processing flow is the same as the MGAE-MT-NUM-REG model.The input of the unsupervised module is unlabeled data.After encoding by the ERNIE layer,the word embedding vectors are weakly augmented and strongly augmented,and pseudo-labels are generated by predicting the probability distribution for the original vector and the weakly augmented vector.The generated pseudo-labels and the strongly augmented vector are used to calculate the loss for model training.The experiment shows that compared with using only 100 labeled texts,the performance of the model is improved when using 100 labeled texts and 500 unlabeled texts,with F1 values improved by 9.06%,7.94% and 10.51% on the specimen,lesion,and attribute granularity,respectively.(3)Finally,based on the proposed model and the requirements of practical application scenarios,a prototype system for multi-granularity attribute extraction of lung cancer diagnosis texts is designed.After the user inputs a piece of lung cancer pathological diagnosis text,the system calls the background algorithm to extract multi-granularity attributes from the text and displays the results on the system page.The experimental results show that the MGAE-MT-NUM-REG model achieves multi-granularity attribute extraction,and the Semi-MT-MGAE model improves the effectiveness of small-sample multi-granularity attribute extraction,meeting the expected design requirements. |