| Visual anomaly detection aims to identify abnormal samples that do not conform to the frequently-observed normal samples.Specifically,it classifies visual data into the normal class and abnormal class by learning the normal patterns in normal samples.Visual anomaly detection enables widely applications in many fields,ranging from industrial defect detection to medical image diagnosis and intelligent video surveillance,and has important research and application values.Different from the traditional classification problem,the key challenges of visual anomaly detection lie in the scarcity,ambiguity,and diversity of anomalies.Specifically,the occurrence probability of abnormal events is far lower than that of normal events,resulting in imbalance between normal and abnormal data.The identification of anomalies depends on the scene they are in.Besides,the abnormal events have large intra-class differences,and they vary in motion dynamics,camera angles and scales.Consequently,the key scientific problem to be solved in visual anomaly detection is how to capture and represent the discriminative semantic information of normal and abnormal categories with the unbalanced training data,so as to break through the problem of visual anomaly detection.In recent years,deep neural networks-based methods have achieved remarkable performance improvements in the task of visual anomaly detection.However,there are still some limitations in deep neural networks-based visual anomaly detection.In order to address the above-mentioned challenges,this paper researches the visual anomaly detection method based on deep neural networks from four aspects: mining discriminative semantic representation,enhancing the representation consistency of intra-classes,enlarging the margin between inter classes,and modeling the spatio-temporal interactions among individuals.The major contributions of this dissertation are as follows:To address the problem that existing methods hardly capture the discriminative semantic features,we propose an image-level visual anomaly detection method based on self-supervised representation learning.The proposed method embeds the self-supervised representation learning task into image anomaly detection framework based on deep autoencoder.Specifically,the self-supervised task,autoencoding transformation,is embedded into deep autoencoder to form an information gap between input and supervision of model,so as to facilitate the model to capture the discriminative semantic features of images.Meanwhile,the self-supervised task hinders the transformation reconstruction of abnormal images,which further improves the ability of the model to recognize abnormal images.Besides,the proposed model is optimized by maximizing the lower bound of joint mutual information between the learned representations and the input of model,which ensures the learned representations contain adequate information for reconstructing the applied transformation and original image.Finally,the experimental results on multiple datasets demonstrate the effectiveness and advancement of the proposed method.To overcome the deficiencies caused by the diversity of anomalies,we propose a pixel-level visual anomaly detection method based on prototype learning and uncertainty estimation.To begin with,a memory-guided prototypical transformer encoder is designed to learn and memorize the prototypical representations of various anomalies in training data.Then,the input features are reconstructed as weighted sum of prototypical representations stored in memory module.In this way,our model is enabled to yield semantically consistent representations for the same type of anomalies and capture the diversity of anomalies.Furthermore,a Bayesian deep learning-based anomaly detection uncertainty quantizer is designed to learn the probability distribution over detection results and estimate the anomaly uncertainty of each pixel.Finally,a prototype and uncertainty-guided transformer decoder is designed to infer anomalies with the assist of anomaly prototypes and uncertainties.As a result,the proposed model can capture the diversity and ambiguity of anomalies and obtains remarkable performance improvements.To tackle the problem of low discrimination of existing deep anomaly detection models caused by the ambiguity of anomalies,we propose a novel video-level visual anomaly detection approach based on self-supervised attentive generative adversarial networks.The proposed approach improves the existing deep anomaly detection models by enlarging the gap of abnormal scores between normal and abnormal frames from two perspectives.On the one hand,we insert the self-attention mechanism into the generative adversarial networks to capture the long-range contextual information for improving the prediction qualities of normal frames.On the other hand,our method adds a selfsupervised discriminator for conducting the self-supervised task.It can enlarge the gap of anomaly scores between normal and abnormal frames by lessening the generalization of model for abnormal frames.Finally,the proposed method utilizes the errors of video prediction and self-supervised task to calculate the anomaly score of each frame.The experimental results on multiple video anomaly detection datasets demonstrate the effectiveness and advancement of the proposed method.To address the problem that existing methods ignore the spatial-temporal interaction between objects in videos,we propose a hierarchical graph embedding-based framework via spatio-temporal transformer,which leverages the strength of graph representation in encoding strongly-structured skeleton feature.Specifically,the input skeleton features are constructed as a hierarchical spatio-temporal graph.Each node in global graph encodes the speed of each object as well as the relative position and interaction relations between objects.In contrast,the local graph encodes the pose of an object independently and can be only used to identify abnormal behavior of a single object.Furthermore,a novel taskspecific spatio-temporal graph transformer is designed to encode the hierarchical spatiotemporal graph embeddings of skeletons and learn the regular patterns within normal training videos based on motion prediction.In this way,the proposed model can utilize both global and local information of skeletons to improve the accuracy and robustness of anomaly detection model.The experimental results on multiple video anomaly detection datasets indicates that the proposed method outperforms existing approaches. |