| With the rapid development of the Internet,all kinds of data on the internet have grown exponentially.Among them,unstructured text data accounts for a large proportion,while many data store in the form of semi-structured tables.It leads to a very challenging problem to help users quickly obtain effective information from the huge amount of heterogeneous data containing both tabular and textual data.The question answering(QA)system is an effective way to solve this challenge by allowing people to ask questions in natural language and directly derive answers,which can help users quickly focus on key information from the explosive growth of Internet data.The QA systems can be divided into single-hop QA and multi-hop QA according to the depth of reasoning.Multi-hop QA requires multiple iterations of reasoning to find the answer,which is more in line with the complex QA situations in real life.The QA system can be divided into close-domain(single corpus)and open domain(massive corpus)according to the size of the corpus.For these two scenarios,existing QA systems still face the following problems: Firstly,in the close-domain scenario with a single corpus,most QA systems tend to treat table data directly as text when dealing with heterogeneous data of tables and text,losing the semi-structured information unique to tables.Secondly,in the open-domain scenario with a massive corpus,existing retrieval methods can hardly support the retrieval of tabular and textual heterogeneous data well,which leads to low precision and is very time-consuming,and is not conducive to the subsequent multi-hop reasoning of the questions.To address the above issues,this thesis proposes a complete set of multi-hop QA schemes for tabular and textual heterogeneous data which integrates both close-domain and open-domain scenarios.This thesis focuses on the study of multi-hop QA tasks for tabular and textual heterogeneous data(called Table Text QA),and uses Table Text Meta Data which compose of table row and associated text as the basic processing unit.The work is summarized as follows.(1)Aiming at the QA task for mixed tabular and textual data in close-domain scenarios,this thesis proposes a Table Text QA algorithm RHGN-CDTTQA based on a hierarchical graph neural network,which mainly includes two stages: row selection and row inference.The row selection phase first retrieves the evidence information associated with the question,and then ranks the table rows by a pre-trained model to select the row that most likely contains the answer.In the row inference stage,a hierarchical graph neural network for Table Text Meta Data is designed to construct the question,cell,and text nodes for inference,so as to capture the semi-structured information specific to the table and the association information between the table and the text.Experimental results show that the RHGNCDTTQA algorithm can accurately locate answer positions from the given tabular and textual data.(2)Focusing on the QA task for mixed tabular and textual data in the open-domain scenario,this thesis proposes a Table Text QA algorithm HR-ODTTQA based on hybrid retrieval,which mainly includes evidence linking,evidence retrieving,and evidence reasoning.This thesis first uses generative methods to enhance the information of cell content,and then uses dual encoders to match table cells and texts to construct a Table Text Meta Data set;after that,this thesis design a retrieval model based on a dense retriever and propose a negative sampling enhancement method to improve the discriminative ability of the retrieval model.At the same time,this thesis uses a combination of sparse retrieval and dense retrieval to reduce the retrieval time while ensuring retrieval accuracy.Based on this,this thesis extend the hierarchical graph neural network in the close-domain to multiple Table Text Meta Data to solve the problem of extracting answers from multiple metadata.The experimental results show that the HR-ODTTQA algorithm can quickly and accurately retrieve key evidence information and locate answer from massive tabular and textual heterogeneous data.(3)Based on the above two algorithms,this thesis designs and implements a prototype system of multi-hop QA for tabular and textual data.The prototype system mainly contains information extraction module,data storage module,UCL construction module,question inference module and visual interaction module.It is able to combine RHGN-CDTTQA and HR-ODTTQA algorithms to realize open-domain QA and close-domain QA functions respectively,while satisfying users’ needs of information retrieving,data visualization,and data editing.Finally,the usability and robustness of the prototype system are verified by system testing. |