| In recent years,with the continuous development of artificial intelligence,natural language processing has become the main direction for more and more scholars to explore.As one of the core tasks of natural language processing,machine reading comprehension has also been favored by many researchers.Extractive reading comprehension is an important branch of machine reading comprehension.Many large language models have achieved a high level of effectiveness in the task of extractive reading comprehension.However,the performance of these models will be significantly reduced when the number of samples is small,and the use of back-translation and other methods for data augmentation of data sets can expand the number of samples,but will cause the newly generated samples to not contain the correct answer to the question,The quality of newly generated samples is reduced.In view of these problems,this paper has carried out the following research from two aspects of data sets and models:1.In view of the poor performance of the model in the extraction reading comprehension task in the case of a small number of samples,a simple data augmentation is carried out on the existing small amount of data.For the words in the text,while retaining the correct answers in the text,the sample data can be expanded by randomly selecting words for synonym replacement,random insertion,and random deletion,and new vocabulary can be introduced to enhance the generalization of the model.For the sentences in the text,randomly exchange the two sentences in the text or randomly select a sentence from the adjacent samples to replace the sentences in the sample,which will produce noise to a certain extent and will not affect the overall structure of the text.2.Aiming at the problem of too much semantic information loss and too much noise caused by simple data augmentation,DASS(Data Augmentation based on Semantic Similarity)is proposed.The semantic similarity of each word in the sentence before and after deletion is calculated to determine the impact of the word on the sentence semantics,and the word with the least impact is selected for data augmentation operation to ensure the semantic consistency of the newly generated sample and the integrity of semantic information.3.In order to solve the large amount of noise introduced in the text after data augmentation,based on the Span BERT model proposed by Transformer,BLADA(Block Attention for Data Augmentation)is proposed.A context prediction layer is added before the original answer prediction layer,and the span in the text with the highest correlation with the answer is calculated,and then the answer prediction is carried out.The added prediction layer can better reduce the impact of noise introduced by data augmentation on answer prediction.This paper conducts experiments on HotpotQA data sets.The experimental results show that the data augmentation methods and model improvements proposed in this paper can effectively reduce the semantic loss caused by data augmentation and improve the performance of the language model in the extraction reading comprehension task,which well solves the problem that the language model does not work well in the case of a small number of samples. |