Frequency-Guided Diffusion Model with Perturbation Training for Skeleton-Based Video Anomaly Detection

1 Department of Computer Science and Engineering, Southeast University, Nanjing, China 2 Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications 3 New Laboratory of Pattern Recognition (NLPR), State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA) 4 School of Artificial Intelligence, University of Chinese Academy of Sciences

Motivation

MY ALT TEXT

The data illustration. (a) The training and testing data, where the training data is composed of seen normal motions and the testing data contains unseen normal and abnormal motions. Although seen and unseen motions represent the same action (e.g., walking), their local details, such as stride length, arm swing amplitude, and joint angles, exhibit significant differences. (b) The frequency analyses of motions. This analysis reveals that a motion retaining only 70\% of its low-frequency information remains largely similar to the original motion in terms of global structure, with minor differences observed in the low-frequency regions. Note that low-frequency and high-frequency regions do not correspond directly to specific joints. Instead, low-frequency regions are defined as areas where joints predominantly contain low-frequency information while also exhibiting a relatively higher proportion of high-frequency details.


Method

MY ALT TEXT

Comparison between our proposed method (green) and existing methods (blue). During the training phase, we employ adversarial training for the perturbation generator and denoiser to enhance model robustness. Specifically, the perturbation generator attacks the observed motion, producing motions that are challenging to reconstruct yet resemble normal motions. These perturbed motions are then used to train the denoiser, thereby improving its robustness. During the inference phase, we apply DCT to separate observed motion into global and local components, represented as low-frequency and high-frequency information. By leveraging high-frequency information as guidance, our method can accurately reconstruct observed motion compared to existing methods.

Abstract

Video anomaly detection (VAD) is a vital yet complex open-set task in computer vision, commonly tackled through reconstruction-based methods. However, these methods struggle with two key limitations: (1) insufficient robustness in open-set scenarios, where unseen normal motions are frequently misclassified as anomalies, and (2) an overemphasis on, but restricted capacity for, local motion reconstruction, which are inherently difficult to capture accurately due to their diversity. To overcome these challenges, we introduce a novel frequency-guided diffusion model with perturbation training. First, we enhance robustness by training a generator to produce perturbed samples, which are similar to normal samples and target the weakness of the reconstruction model. This training paradigm expands the reconstruction domain of the model, improving its generalization to unseen normal motions. Second, to address the overemphasis on motion details, we employ the 2D Discrete Cosine Transform (DCT) to separate high-frequency (local) and low-frequency (global) motion components. By guiding the diffusion model with observed high-frequency information, we prioritize the reconstruction of low-frequency components, enabling more accurate and robust anomaly detection. Extensive experiments on five widely used VAD datasets demonstrate that our approach surpasses state-of-the-art methods, underscoring its effectiveness in open-set scenarios and diverse motion contexts.


Framework

MY ALT TEXT

The framework of the proposed method. The model is trained utilizing generated perturbation examples. The training phase includes two processes: minimizing the mean square error to train the noise predictor and maximizing this error to train the perturbation generator. During the testing phase, the high-frequency information of observed motions and the low-frequency information of generated motions are fused for effective anomaly detection.

Illustration of Perturbation Training

MY ALT TEXT

The illustration of perturbation training. In Fig. (a), the green and yellow points denote the original training $x_k$ and perturbed motion $\hat{x}_k$, respectively. The red region represents the distribution of unseen normal samples. Accordingly, Fig. (b) demonstrates that the reconstruction domain is extended by our proposed perturbation training.

Visualizations

The left videos are frame-level ground true label, and the right ones are detect results.

Experiment Results


Comparison of the proposed method against other SoTA methods. The best results across all methods are in bold, the second-best ones are underlined, and the superscript \textsuperscript{‡} denotes the best performance across all the methods under each paradigm.

MY ALT TEXT

Robust analysis of perturbations training. ``PT'' denotes perturbations training.``$\lambda_{PI}$'' represents the perturbations intensity in inference.

MY ALT TEXT

Sensitivity analyses of DCT-Mask threshold $\lambda_\text{dct}$.

MY ALT TEXT

Anomaly score curves on the Avenue and HR-UBnormal datasets. (a) Avenue dataset; (b) HR-UBnormal dataset. The horizontal axis represents the frame index, the red circles in the clip of each figure denote the abnormal events, and the green circles represent the normal ones.

MY ALT TEXT