Although deep neural networks (DNNs) enable great progress in video abnormal event detection (VAD), existing solutions typically suffer from two issues: (1) The localization of video events cannot be both precious and comprehensive. (2) The semantics and temporal context are under-explored. To tackle those issues, we are motivated by the prevalent cloze test in education and propose a novel approach named Visual Cloze Completion (VCC), which conducts VAD by learning to complete "visual cloze tests" (VCTs). Specifically, VCC first localizes each video event and encloses it into a spatio-temporal cube (STC). To achieve both precise and comprehensive localization, appearance and motion are used as complementary cues to mark the object region associated with each event. For each marked region, a normalized patch sequence is extracted from current and adjacent frames and stacked into a STC. With each patch and the patch sequence of a STC compared to a visual "word" and "sentence" respectively, we deliberately erase a certain "word" (patch) to yield a VCT. Then, the VCT is completed by training DNNs to infer the erased patch and its optical flow via video semantics. Meanwhile, VCC fully exploits temporal context by alternatively erasing each patch in temporal context and creating multiple VCTs. Furthermore, we propose localization-level, event-level, model-level and decision-level solutions to enhance VCC, which can further exploit VCC's potential and produce significant performance improvement gain. Extensive experiments demonstrate that VCC achieves state-of-the-art VAD performance. Our codes and results are open at https://github.com/yuguangnudt/VEC_VAD/tree/VCC.
翻译:虽然深层神经网络(DNNS)使得视频异常事件探测(VCD)取得巨大进展,但现有解决方案通常有两个问题:(1)视频事件本地化不可能既宝贵又全面。(2) 语义和时间背景没有得到充分探讨。为了解决这些问题,我们受到教育中普遍存在的凝块测试的激励,并提议了一个名为VCC的新型方法,即通过学习完成“视觉凝块测试”(VCTs)来进行VAD。具体地说,VCC首先将每个视频事件本地化,并将其插入一个时空立方块(STC)。为了实现准确和全面本地化,将视频事件本地化、外观和运动用作与每个活动相关的目标区域的补充提示。对于每个特定区域,从当前和相邻的框架中提取一个正常的补接线序列,并堆叠成一个STC。与视觉“字”和“感知”相比,我们刻意删除某种“语言/感识”,然后,VCTC/salalalation(Patch)产生VCT-Calate)一个VCT值,然后,VCC/dealalalalal 级,我们通过Salal-dealalalalalalalalalalal 和Salalalalal 将完成一个功能,我们通过视频和多级的版本,我们通过视频/dededededealdealalalalalals。