| Document image is a type of image format that contains document content,often including diverse page components and complex logical structures,making document layout analysis a highly challenging task.Specifically,layout analysis involves detecting and classifying all elements in a page,such as tables,illustrations,formulas,and titles,by analyzing the document image.Document image layout analysis is an important step in achieving automated analysis and recognition of document images.Compared to natural scene images,the visual feature contrast between foreground objects and background regions in document images is relatively small.Therefore,enhancing the perception ability of feature extraction networks for key features and establishing a more powerful feature representation for layout analysis tasks are critical issues for improving algorithm performance.Based on this,the research content of this paper mainly focuses on the following two aspects:In order to solve the problem that convolutional neural network does not pay enough attention to key information when extracting image features,a document layout analysis algorithm is proposed based on a hybrid attention mechanism to build a more powerful feature representation based on input image information.By integrating attention mechanisms into the feature extraction network,the emphasis of network on key features is enhanced and irrelevant information is weakened from both spatial and channel dimensions.In the spatial dimension,the attention mechanism is used to guide the neural network to focus on key information areas by expanding the spatial range of convolutional layer sampling and improving the spatial position of convolutional sampling.The spatial offset of the sampling position can bring more contextual information and enhance the connection between features.In the channel dimension,each channel of the feature map is biased towards certain features.The importance of different channel information is explored from a global perspective in the channel dimension,selectively emphasizing or weakening channel features to improve the expression of important channel features.By embedding the information learned from these two dimensions of attention mechanisms through lateral connections,the multiscale feature pyramid is improved to enhance the model’s ability to extract image features.It achieves 94.3%mAP on the PubLayNet dataset,which brings an effective improvement.The feature extraction network based on convolutional neural network has a strong perception ability for local features,but is slightly inadequate for the information association of different spatial regions globally.To solve this problem,this paper presents a layout analysis algorithm based on global self-attention mechanism.A layout analysis method based on global self-attention mechanism is proposed.The image is partitioned into patches as input,and a backbone network based on Transformer is established to extract image feature information.The backbone network adopts a fourstage hierarchical design to output cross-scale feature maps.Self-attention mechanism can effectively establish the information connection among various regions in global scope,which contribute to fully aggregate the context information.On the other hand,Mixed-MLP network is integrated into feature extraction network,which is based on multi-layer perceptron and attention mechanisms in spatial and channel dimension.The features output from the backbone network are partitioned into the Mixed-MLP network,and the channel information interaction is established within each patch,while the communication of global spatial information is realized among all image patches with fully connected layers.The algorithm reaches 95.8%mAP on PubLayNet dataset and achieves good results of layout analysis.At the same time,it explores a direction to improve the performance of layout analysis algorithm. |