Currently, spatiotemporal features are embraced by most deep learning approaches for human action detection in videos, however, they neglect the important features in frequency domain. In this work, we propose an end-to-end network that considers the time and frequency features simultaneously, named TFNet. TFNet holds two branches, one is time branch formed of three-dimensional convolutional neural network(3D-CNN), which takes the image sequence as input to extract time features; and the other is frequency branch, extracting frequency features through two-dimensional convolutional neural network(2D-CNN) from DCT coefficients. Finally, to obtain the action patterns, these two features are deeply fused under the attention mechanism. Experimental results on the JHMDB51-21 and UCF101-24 datasets demonstrate that our approach achieves remarkable performance for frame-mAP.
翻译:目前,在视频中,人类行动探测的最深层次的学习方法包含了时空特征,但是它们忽略了频率领域的重要特征。在这项工作中,我们提议建立一个端对端网络,同时考虑时间和频率特征,称为TFNet。TFNet拥有两个分支,一个是三维共变神经网络(3D-CNN)的时际分支,将图像序列作为提取时间特征的输入;另一个是频率分支,通过DCT系数的二维共变神经网络(2D-CNN)提取频率特征。最后,为了获得行动模式,这两个特征在关注机制下紧密结合。JHMDB51-21和UCF101-24数据集的实验结果显示,我们的方法在框架-MAP中取得了显著的性能。