| Recently,the development of information science and communication technology leads to the fast increase of speech data,images,videos,etc.,imposing new demands and challenges for the research on multimedia computing.Thanks to the increase in data volume and computation capacity,deep learning achieves new developments and has been one of the most prevalent technologies for signal processing and understanding.In contrast to conventional signal processing methods,neural network based methods cap-ture and exploit the statistical characteristics of input data via cascaded non-linear trans-formation to achieve the intelligent representation,analysis and understanding,which is data-driven.In this process,attention mechanism,as one of key technologies in the application and development of deep learning,simulates the complex cognitive func-tions of the human brain and enables neural networks to selectively receive and process information like human,which is of high importance for improving the ability of neural networks to represent,analyze and understand large-scale data.In this paper,we first summarize prior works on attention mechanism into selective attention and self-attention.We elaborate on their working mechanisms,and analyze their respective advantages and shortcomings.Then,specific to their shortcomings,we try to design more effective attention mechanisms from the perspectives of the types of input data,task demands and task objectives.Furthermore,we study the application of different attention mechanisms under various scenarios and expand the application scope of attention mechanisms.Particularly,our model designs and experiments can be introduced from the following five aspects.(1)We introduce the attention design applied to semantic understanding for images.In this part,we propose relation-aware global attention,which solves the problem of the lack of exploiting global-scope contextual information and capturing feature relations in image-based semantic understanding tasks.Particularly,we propose to determine the importance of features based on the comparison in the global scope,and design a spatial attention module and a channel attention module.By modeling relation along different dimensions in the global scope,we infer the importance(i.e.,attention weights)of features from both the original feature and its corresponding relation feature via a trainable network,which strengthens the task-beneficial information and suppresses the task-unrelated information.(2)We introduce the attention design applied to semantic understanding for videos.In this part,we propose the multi-granularity reference-aided attention mechanism,which is designed for effective feature extraction and fusion in video tasks.For video data,there are some contents which occasionally appear but are novel and also redundant contents in the temporal dimension.In addition,the contents of video data are dynamic so that the semantics of video data are commonly of multiple granularities.Specific to the aforementioned characteristics of video data,our proposed attention mechanism can achieve not only capturing importance semantics in the process of feature extraction and fusion but also reducing the information redundancy of extracted features as possible.(3)We introduce the attention design applied to visual representation learning.In this part,we analyze and discuss the roles of selective attention and self-attention in learning visual representations.We design two different attention-based scalable image compression frameworks.One of them utilizes selective attention to perform hierar-chical feature disentanglement while the other utilizes self-attention to achieve feature decorrelation in the latent space.They improve the coding performance of learned scal-able image compression from these two different perspectives.(4)We introduce the attention design applied to selecting training samples.In this part,we expand the application range of attention mechanisms from deep features to the model inputs.We utilize the attention mechanism to select suitable training samples according to the current status of neural networks in different training stages.We fur-ther apply this design to developing more efficient reinforcement learning algorithms to reduce the dependence on large-scale training data.(5)We introduce the attention design applied to the model outputs.In this part,we further expand the application scope of attention mechanisms and apply it to the model outputs towards better model optimization.Intuitively,neural networks have differ-ent uncertainties on their outputs corresponding to different inputs.However,samples of high uncertainty commonly introduce optimization biases so that the model perfor-mance will be negatively affected.This negative effect is particularly prominent when the training data is limited,i.e.,in the task of few-shot learning.We thus model the heterologous uncertainty in few-shot image classification task,and propose attention-based uncertainty-aware optimization.Particularly,we utilize attention to modulate the model outputs that have different uncertainties.In this way,we can alleviate the side effects of the uncertainty and thus improve the effectiveness of model optimization. |