Font Size: a A A

An Open-Domain QA System Based On Heterogeneous Dense Representations

Posted on:2023-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:T Y ZhouFull Text:PDF
GTID:2558306914477114Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The research topic of this paper is the design of an open-domain question answering system based on heterogeneous dense vector representations.An open-domain question answering system usually consists of three basic modules:retrieval,reranking,and reading comprehension.This paper will discuss and practice the design and training of the retrieval module,and the training and light-weighting of the reranking module,respectively.The role of the retrieval module is to filter out the documents that are most likely to help answer the user’s question from a large-scale document set.Rule-based retrieval methods only focus on the overlap between texts,while neural network-based retrieval methods usually only consider semantic matching or contextual relevance.In order to achieve more accurate text retrieval,this paper proposes to integrate three text relevance features:text overlap,semantic matching and context relevance(or topic consistency),so as to achieve a quantification of relevance between user questions and documents.On this basis,a unique extraction method is designed for these three correlation features without changing the encoder architecture,so that the dense vector representation of these three heterogeneous features can be realized in the neural network model,and finally a fusion representation of the three is achieved.The role of the reranking module is to further determine the supporting documents needed to answer the question from the candidate documents retrieved by the retrieval module.Although both reranking and retrieval module essentially score documents through feature extraction,they are related to two completely different aspects at the linguistic level.The task of the retrieval module is to retrieve documents relevant to the question as much as possible without considering whether these documents are sufficient to support the answer to the question.In order to ensure the consistency of the training process and the inference process,this paper proposes to further construct negative samples based on the candidate documents output by the retrieval module on the existing supervised dataset,so as to ensure that the training data of the reranking model is sufficient to guide the model to learn how to identify documents.Answerability to the question.Finally,this paper attempts to distill the trained reranking model to fit civilian-grade devices.By using the fine-tuned BERT to perform distilled learning on TextCNN,the candidate document capacity of the reranking model is effectively increased,thereby significantly improving the speed of the inference process and the memory usage without losing too much performance.The document retrieval and document reranking method proposed in this paper has achieved significant overall performance improvement on multiple mainstream question answering datasets,and the lightweight reranking method significantly reduces the computational resource consumption of the model.
Keywords/Search Tags:open-domain question answering, document retrieval, document reranking, negative sampling
PDF Full Text Request
Related items