| Machine Reading Comprehension(MRCC)aims to teach machines to answer the correct answer to a question after understanding a given passage of text,and it is also the foundation and long-term goal of natural language understanding.Several different forms of machine reading comprehension tasks already exist,such as extractive and inferential reading comprehension tasks,and researchers usually focus on one type of task,but real-life application situations often require models that can handle many different types of tasks simultaneously.Secondly natural language processing models are often trained on large samples of labeled data with supervised learning methods in the expectation that the model will learn more potential knowledge.However,in practical application scenarios such as legal,financial,and medical fields labeled data is severely lacking,and labeling a large number of samples is relatively expensive.In summary,how to effectively handle multi-task reading comprehension data and unlabeled data becomes an important part of this research.To address the problem of multi-task reading comprehension data processing,existing methods have been used to handle different reading comprehension tasks separately by introducing additional auxiliary loss functions.However,multi-task learning models based on auxiliary loss often use an average loss weighting method,and such processing does not achieve a balance between multiple tasks in model training.Secondly,for the use of unlabeled data,self-training methods can effectively utilize both labeled and unlabeled data to improve the performance of deep learning models.In the field of natural language processing,self-training methods are widely used in text classification and sequence labeling tasks,however,most of them predict the probability distribution of target labels based on sentence embeddings to select pseudo-labeled samples,which is not suitable for span extraction tasks,which require models to predict the answer span of a question from the word level.The innovative work in this paper is as follows: we propose a self-training method for reading comprehension span extraction,which consists of two parts: a multi-task fusion training reading comprehension model and a word-level based pseudo-label selector.The multi-task fusion training reading comprehension model effectively solves the problem that the multi-task learning model based on the auxiliary loss function cannot achieve the balance between multiple tasks in training by unifying the outputs of different task modules as the output of the span extraction task.The word-level-based pseudolabel selector uses the confidence level of the start and end positions in the model prediction output to obtain valuable pseudolabel data,effectively applying the selftraining method to the reading comprehension span extraction task and effectively solving the problem of obtaining pseudolabel at the word-level for the text self-training method.We conducted experiments on SQu AD2.0,CAIL2019,and medical advice text datasets,and the results show that our proposed self-training method for machine reading comprehension span extraction achieves 1-2% improvement in the performance of machine reading comprehension models in legal and medical fields. |