Video anomaly detection (VAD) is a critical yet challenging task due to the complex and diverse nature of real-world scenarios. Previous methods typically rely on domain-specific training data and manual adjustments when applying to new scenarios and unseen anomaly types, suffering from high labor costs and limited generalization. Therefore, we aim to achieve generalist VAD, \ie, automatically handle any scene and any anomaly types without training data or human involvement. In this work, we propose PANDA, an agentic AI engineer based on MLLMs. Specifically, we achieve PANDA by comprehensively devising four key capabilities: (1) self-adaptive scene-aware strategy planning, (2) goal-driven heuristic reasoning, (3) tool-augmented self-reflection, and (4) self-improving chain-of-memory. Concretely, we develop a self-adaptive scene-aware RAG mechanism, enabling PANDA to retrieve anomaly-specific knowledge for anomaly detection strategy planning. Next, we introduce a latent anomaly-guided heuristic prompt strategy to enhance reasoning precision. Furthermore, PANDA employs a progressive reflection mechanism alongside a suite of context-aware tools to iteratively refine decision-making in complex scenarios. Finally, a chain-of-memory mechanism enables PANDA to leverage historical experiences for continual performance improvement. Extensive experiments demonstrate that PANDA achieves state-of-the-art performance in multi-scenario, open-set, and complex scenario settings without training and manual involvement, validating its generalizable and robust anomaly detection capability. Code is released at https://github.com/showlab/PANDA.
翻译:视频异常检测(VAD)因现实场景的复杂性和多样性而成为一项关键且具有挑战性的任务。现有方法在应用于新场景和未见异常类型时,通常依赖特定领域的训练数据和人工调整,存在人力成本高、泛化能力有限的问题。为此,我们致力于实现通用视频异常检测,即在无需训练数据或人工干预的情况下,自动处理任意场景和异常类型。本研究提出PANDA,一种基于多模态大语言模型(MLLMs)的智能体化AI工程师。具体而言,我们通过系统设计四项核心能力实现PANDA:(1)自适应场景感知策略规划,(2)目标驱动的启发式推理,(3)工具增强的自我反思,(4)自我优化的记忆链机制。我们开发了自适应场景感知检索增强生成(RAG)机制,使PANDA能够检索异常相关知识以规划检测策略;引入潜在异常引导的启发式提示策略以提升推理精度;通过渐进式反思机制与情境感知工具集,在复杂场景中迭代优化决策;最后,记忆链机制使PANDA能利用历史经验持续提升性能。大量实验表明,PANDA在无需训练和人工干预的情况下,于多场景、开放集及复杂场景设置中均达到最先进性能,验证了其泛化性强且鲁棒的异常检测能力。代码发布于 https://github.com/showlab/PANDA。