Recent methods for video action recognition have reached outstanding performances on existing benchmarks. However, they tend to leverage context such as scenes or objects instead of focusing on understanding the human action itself. For instance, a tennis field leads to the prediction playing tennis irrespectively of the actions performed in the video. In contrast, humans have a more complete understanding of actions and can recognize them without context. The best example of out-of-context actions are mimes, that people can typically recognize despite missing relevant objects and scenes. In this paper, we propose to benchmark action recognition methods in such absence of context and introduce a novel dataset, Mimetics, consisting of mimed actions for a subset of 50 classes from the Kinetics benchmark. Our experiments show that (a) state-of-the-art 3D convolutional neural networks obtain disappointing results on such videos, highlighting the lack of true understanding of the human actions and (b) models leveraging body language via human pose are less prone to context biases. In particular, we show that applying a shallow neural network with a single temporal convolution over body pose features transferred to the action recognition problem performs surprisingly well compared to 3D action recognition methods.
翻译:视频动作识别的最新方法在现有基准上取得了杰出的成绩。 但是,它们倾向于利用场景或物体等背景,而不是侧重于理解人类行动本身。 例如,网球场导致预测无论视频中所演动作如何都打网球。 相比之下,人类对行动有更完整的理解,并且可以毫无背景地认出它们。 超文本行动的最好例子是MIME, 尽管缺少相关的天体和场景,人们通常还是可以识别。 在本文中,我们提议在没有上下文的情况下以行动识别方法为基准,并引入新的数据集Mimetics, 由来自动因学基准的50类子类的模拟动作组成。 我们的实验显示, (a) 最先进的3D进化神经网络在这种视频上取得了令人失望的结果,突出了对人类行动缺乏真正的了解, (b) 借助人体语言的模型不太容易受到背景偏见的影响。 特别是,我们显示,在身体上一个单一时间变异的浅神经网络,给行动识别问题带来了与3D行动识别方法相比令人吃惊的特征。