Weakly-Supervised Video Anomaly Detection aims to identify anomalous events using only video-level labels, balancing annotation efficiency with practical applicability. However, existing methods often oversimplify the anomaly space by treating all abnormal events as a single category, overlooking the diverse semantic and temporal characteristics intrinsic to real-world anomalies. Inspired by how humans perceive anomalies, by jointly interpreting temporal motion patterns and semantic structures underlying different anomaly types, we propose RefineVAD, a novel framework that mimics this dual-process reasoning. Our framework integrates two core modules. The first, Motion-aware Temporal Attention and Recalibration (MoTAR), estimates motion salience and dynamically adjusts temporal focus via shift-based attention and global Transformer-based modeling. The second, Category-Oriented Refinement (CORE), injects soft anomaly category priors into the representation space by aligning segment-level features with learnable category prototypes through cross-attention. By jointly leveraging temporal dynamics and semantic structure, explicitly models both "how" motion evolves and "what" semantic category it resembles. Extensive experiments on WVAD benchmark validate the effectiveness of RefineVAD and highlight the importance of integrating semantic context to guide feature refinement toward anomaly-relevant patterns.
翻译:弱监督视频异常检测旨在仅利用视频级标签识别异常事件,在标注效率与实际应用性之间取得平衡。然而,现有方法通常将所有异常事件视为单一类别,从而过度简化了异常空间,忽视了现实世界异常事件内在的多样化语义与时间特性。受人类感知异常方式的启发——通过联合解读不同异常类型背后的时间运动模式与语义结构,我们提出了RefineVAD,一个模拟这种双重过程推理的新颖框架。我们的框架集成了两个核心模块。第一个是运动感知时间注意力与重校准模块,它通过基于位移的注意力和全局基于Transformer的建模来估计运动显著性并动态调整时间关注点。第二个是面向类别的精细化模块,它通过跨注意力机制将片段级特征与可学习的类别原型对齐,从而将软异常类别先验注入表示空间。通过联合利用时间动态与语义结构,该框架显式地建模了运动“如何”演化以及它“属于何种”语义类别。在WVAD基准测试上进行的大量实验验证了RefineVAD的有效性,并突显了整合语义上下文以引导特征向异常相关模式进行精细化的重要性。