| Due to high relevance,tables are widely applied in scientific literature to record data,providing a valuable reference for subsequent research.The aggregation of table data from different literatures into a single,standardised collection provides more comprehensive,accurate and systematic scientific and technical information,which slovies data supporting the solution of increasingly complex scientific problems.Deep learning-based crossdocumentation table fusion technology can extract and fuse key-value pairs in tables by extracting semantic features of the tables,resulting in efficiency in integrating scientific and technical data.However,text boundaries cannot be accurately detected due to the dense text within the table.In addition,the different descriptions of the same key in different literatures make it difficult to extract table semantics in a small sample setting,so that table fusion still faces the following challenges:(1)In order to prevent NMS operations to improve the efficiency for extracting table structures,this paper proposes an image processing-based method for extracting table information.Firstly,Retina Net is used to extract and fuse multi-scale document screenshot features to get the table location information.Secondly,a text recognition network str-PG-Net is proposed to detect the text skeleton by morphological methods,and a full convolutional network is used to extract text centreline and text border features;furthermore,a binary classification neural network is used to get text orientation features,which are combined with the joint decoding of skeleton centroids to accurately detect the text position and avoid NMS operations in order to improve the table information extraction efficiency.Finally,a heuristic algorithm based on text spacing is used to detect the cell position,so as to obtain the table structure information.The experimental results show that the method proposed in this paper improve the table information extraction efficiency and optimise the capability of cell detection in the case of small samples.(2)To address the problem of sparse semantic data of tables in a small sample environment,this paper proposes a semantic model-based table fusion method.Specifically,character embedding trains to alleviate data sparsity,a Bidirectional Long and Short-Term Neural Networks(Bi-LSTM)is used to extract table key semantic features,and softmax is used to identify table key.In addition,a table classification method based on title semantic features is proposed by combining title-specific tables with table key semantic features,merging similar tables based on table key semantics,and using a graph database to store the fused tables.Experimental results show that the method proposed can identify unknown keys and improve the accuracy of table key classification. |