Font Size: a A A

Research On The Improvement And Application Of Transformer Method In Computer Visio

Posted on:2024-07-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2568306917473304Subject:Computer technology
Abstract/Summary:PDF Full Text Request
At present,with the rapid development of technology,some predict that the third industrial revolution will be initiated by artificial intelligence technology.With the deepening of research on the Internet and artificial intelligence,researchers are increasingly eager to extract the necessary information from images.Some traditional machine learning methods cannot meet the actual needs of society and downstream tasks,so researchers are no longer limited to convolutional neural networks,Actively exploring more excellent visual models is precisely because in this context,the Transformer model,which is like a fish in water in the field of natural language processing,has entered the field of computer vision,bringing new opportunities and possibilities to many visual tasks.Based on this,this paper focuses on the improvement and application research of the Transformer method in computer vision,Three different visual backbone models have been proposed to handle visual tasks,namely:(1)Selective Patch Transformer for image classification.In order to solve the problem of current visual Transformer models lacking the ability to extract and fuse multiscale features,this paper designs a selective Patch Transformer(Sep Former)for image classification.There are two important designs in Sep Former,one is the Patch Pyramid module,which helps networks obtain multi-scale features;The second is the selective scaling module,which adaptively assigns weights to branches of different scales based on the comprehensive information of all branches,for feature fusion and enhancement.Our model achieves significant performance improvement with relatively low computational costs.(2)Vision Transformer based on deformable windows.In order to solve the problem of restricting attention interaction to handmade windows,which was previously proposed by some Transformer methods,the current purely hand-designed fixed size windows limit the long-term dependency of model modeling and the ability to adapt to objects of different sizes.This makes it necessary for distant target objects to interact with information at a deeper level,In Chapter 4 of this article,we explore the method of using deformable windows to restore the long-term dependency of Transformer modeling.We propose the Vision Transformer method based on deformable windows(DWT),which aims to adaptively adjust the position and proportion of windows through learning from data,reducing dependence on manual windows.This method is easy to replace,and the adjusted window will be more focused on the area where the target object exists,And it can capture more information about relevant regions,strengthen information exchange between overlapping windows,and significantly improve the overall performance of the model at the expense of very small computational costs.(3)The Vision Transformer method based on cross scale attention fusion.This article proposes a Transformer visual backbone,abbreviated as CVT,that can extract cross scale features.It is applied in semantic segmentation tasks and can generate stronger image features.On the one hand,we have increased the length of the query vector(Q),fully utilizing the rich contextual information near adjacent positions;On the other hand,we have designed a simple and effective feature fusion method to obtain scale aware semantics and construct powerful hierarchical features,which are crucial for dense prediction tasks.Afterwards,we conducted sufficient experiments on this method in semantic segmentation tasks,confirming its effectiveness and excellent performance.
Keywords/Search Tags:Deep learning, Transformer, Attention mechanism, Convolutional neural network, Multiscale feature
PDF Full Text Request
Related items