In recent years, anomaly events detection in crowd scenes attracts many researchers' attention, because of its importance to public safety. Existing methods usually exploit visual information to analyze whether any abnormal events have occurred due to only visual sensors are generally equipped in public places. However, when an abnormal event in crowds occurs, sound information may be discriminative to assist the crowd analysis system to determine whether there is an abnormality. Compare with vision information that is easily occluded, audio signals have a certain degree of penetration. Thus, this paper attempt to exploit multi-modal learning for modeling the audio and visual signals simultaneously. To be specific, we design a two-branch network to model different types of information. The first is a typical 3D CNN model to extract temporal appearance features from video clips. The second is an audio CNN for encoding Log Mel-Spectrogram of audio signals. Finally, by fusing the above features, a more accurate prediction will be produced. We conduct the experiments on SHADE dataset, a synthetic audio-visual dataset in surveillance scenes, and find introducing audio signals effectively improves the performance of anomaly events detection and outperforms other state-of-the-art methods. Furthermore, we will release the code and the pre-trained models as soon as possible.
翻译:近年来,人群场异常事件的探测吸引了许多研究人员的注意,因为它对公众安全很重要。现有方法通常利用视觉信息分析是否发生异常事件,因为只有视觉传感器才造成异常事件。但是,在人群中发生异常事件时,声音信息可能具有歧视性,有助于人群分析系统确定是否存在异常现象。与易于隐蔽的视觉信息相比,音频信号具有某种程度的渗透力。因此,本文件试图利用多模式学习,同时模拟视听信号。具体地说,我们设计一个两管网络来模拟不同类型的信息。第一种是典型的3DCNN模式,从视频剪辑中提取时间外观特征。第二种是用于调制成日志Mel-Spectrography音信号的音频CNNN。最后,通过使用上述特征,将产生更准确的预测。我们在监控场进行SHADE数据集实验,一个合成的视听数据集,并发现引入音频信号可以有效改进异常事件探测的性能,并超越其他状态前的代码发布方法。此外,我们很快将开发其他的代码。