Frequency-Guided Diffusion Model with Perturbation Training for Skeleton-Based Video Anomaly Detection

Xiaofeng Tan^1,2, Hongsong Wang^1,2, Xin Geng^1,2 Liang Wang^3,4

¹ Department of Computer Science and Engineering, Southeast University, Nanjing, China ² Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications ³ New Laboratory of Pattern Recognition (NLPR), State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA) ⁴ School of Artificial Intelligence, University of Chinese Academy of Sciences

ArXiv Code

Motivation

The data illustration. (a) The training and testing data, where the training data is composed of seen normal motions and the testing data contains unseen normal and abnormal motions. Although seen and unseen motions represent the same action (e.g., walking), their local details, such as stride length, arm swing amplitude, and joint angles, exhibit significant differences. (b) The frequency analyses of motions. This analysis reveals that a motion retaining only 70\% of its low-frequency information remains largely similar to the original motion in terms of global structure, with minor differences observed in the low-frequency regions. Note that low-frequency and high-frequency regions do not correspond directly to specific joints. Instead, low-frequency regions are defined as areas where joints predominantly contain low-frequency information while also exhibiting a relatively higher proportion of high-frequency details.

Method

Comparison between our proposed method (green) and existing methods (blue). During the training phase, we employ adversarial training for the perturbation generator and denoiser to enhance model robustness. Specifically, the perturbation generator attacks the observed motion, producing motions that are challenging to reconstruct yet resemble normal motions. These perturbed motions are then used to train the denoiser, thereby improving its robustness. During the inference phase, we apply DCT to separate observed motion into global and local components, represented as low-frequency and high-frequency information. By leveraging high-frequency information as guidance, our method can accurately reconstruct observed motion compared to existing methods.

Framework

The framework of the proposed method. The model is trained utilizing generated perturbation examples. The training phase includes two processes: minimizing the mean square error to train the noise predictor and maximizing this error to train the perturbation generator. During the testing phase, the high-frequency information of observed motions and the low-frequency information of generated motions are fused for effective anomaly detection.

Illustration of Perturbation Training

The illustration of perturbation training. In Fig. (a), the green and yellow points denote the original training $x_k$ and perturbed motion $\hat{x}_k$, respectively. The red region represents the distribution of unseen normal samples. Accordingly, Fig. (b) demonstrates that the reconstruction domain is extended by our proposed perturbation training.

Visualizations

The left videos are frame-level ground true label, and the right ones are detect results.

Experiment Results

Comparison of the proposed method against other SoTA methods. The best results across all methods are in bold, the second-best ones are underlined, and the superscript \textsuperscript{‡} denotes the best performance across all the methods under each paradigm.

Robust analysis of perturbations training. ``PT'' denotes perturbations training.``$\lambda_{PI}$'' represents the perturbations intensity in inference.

Sensitivity analyses of DCT-Mask threshold $\lambda_\text{dct}$.

Anomaly score curves on the Avenue and HR-UBnormal datasets. (a) Avenue dataset; (b) HR-UBnormal dataset. The horizontal axis represents the frame index, the red circles in the clip of each figure denote the abnormal events, and the green circles represent the normal ones.