| The Transformer architecture with the self-attention module as the core has already occupied an important position in the field of computer vision.The Vision Transformer pre-trained on a large-scale dataset can quickly converge to the optimal point in image classification and image segmentation tasks.For many architecture based on Transformer,after importing pre-trained weights,it can also achieve good results in downstream tasks,especially tasks with limited training data.These phenomena fully reflect that there is a certain connection between the self-attention module and model pre-training,but the current research on this connection is still very limited.An in-depth analysis of the self-attention module and pre-training will help us further understand the essential mechanism of the deep learning pre-training large model,and provide theoretical guidance for the improvement and application of the self-attention module.This paper takes the VIT model as the research object,discusses the essential mechanism of the self-attention module in the learning process and its special properties after pre-training on large-scale data sets.The convolutional layer can extract local features well,but it is insufficient in the fusion of long-distance features.The selfattention module initially extracts image features,calculates attention based on these initially extracted features,and then recombines image features based on attention,establishing a strong interactive relationship between various parts of the image,thus having good long-distance feature modeling capability.For the calculation process of the attention matrix,this paper analyzes the influence of the attention matrix learning process on the integrity of the feature transfer from the perspective of rank,and using the eigenvalue theory to analyze the differences in the properties of the self-attention modules of different layers of the VIT.Then give an analysis of the essential characteristics of the self-attention structure.The general nature of the self-attention structure itself is theoretically explored.In addition,we further investigate the good properties of the self-attention module pre-trained on large-scale datasets by comparing with the randomly initialized VIT model.The results show that the model with the selfattention module as the core has better robustness and stability after obtaining better feature extraction capabilities.The main innovations of the paper are as follows:(1)The analysis between softmax function of VIT model and image feature extraction is given.We found that the well-pretrained VIT model can filter part of the information in the calculation of the correlation of each part of the image,and highlight the image features in the calculation,but the lack of information will also cause the correlation matrix to be in a low-rank state.If the correlation matrix remains in a lowrank state during the subsequent learning process,some features of the image will continue to be lost,resulting in learning failure.The VIT model introduces the softmax function to solve this problem.We proved that the softmax function can change the correlation between vectors,and the correlation matrix rises to a state close to full rank after being subjected to softmax,which greatly avoids the loss of features.However,for the randomly initialized VIT model,due to the randomness of the initial parameters,more image features will be lost during correlation calculation,resulting in the correlation matrix being still in a low-rank state after being subjected to softmax;(2)The iterative stability of the VIT model is analyzed based on the properties of the eigenvalues of the attention matrix.In the VIT model after a good learning process,the distribution of eigenvalues of the attention matrix at different depths has obvious differences.Specifically,the characteristics of the shallow layer The modulus distribution of the value is in the interval [0,1].As the number of layers increases,the eigenvalues of the attention matrix gradually tend to 0,which plays a role in adjusting the feature weight according to the depth and ensuring the convergence of the learning process.Based on the eigenvalue distribution characteristics,The article analyzes the iterative process of input data between different layers,and proves that the iterative process of the pre-trained VIT model is stable,and as the number of layers deepens,the local stability of the corresponding layer increases accordingly,and the output data tends to convergence. |