Intuition might suggest that motion and dynamic information are key to video-based action recognition. In contrast, there is evidence that state-of-the-art deep-learning video understanding architectures are biased toward static information available in single frames. Presently, a methodology and corresponding dataset to isolate the effects of dynamic information in video are missing. Their absence makes it difficult to understand how well contemporary architectures capitalize on dynamic vs. static information. We respond with a novel Appearance Free Dataset (AFD) for action recognition. AFD is devoid of static information relevant to action recognition in a single frame. Modeling of the dynamics is necessary for solving the task, as the action is only apparent through consideration of the temporal dimension. We evaluated 11 contemporary action recognition architectures on AFD as well as its related RGB video. Our results show a notable decrease in performance for all architectures on AFD compared to RGB. We also conducted a complimentary study with humans that shows their recognition accuracy on AFD and RGB is very similar and much better than the evaluated architectures on AFD. Our results motivate a novel architecture that revives explicit recovery of optical flow, within a contemporary design for best performance on AFD and RGB.
翻译:相反,有证据表明,最先进的深学习视频理解结构偏向于单一框架提供的静态信息。目前,缺少一种方法和相应的数据集来分离视频中动态信息的影响。由于缺乏这种方法和相应的数据集,难以理解当代结构利用动态与静态信息的情况如何好。我们用一个新的外观自由数据集(AFD)来回应行动识别。AFD缺乏与行动识别有关的静态信息。模拟动态是完成任务所必需的,因为行动只有通过考虑时间因素才能显现出来。我们评估了11个当代对AFD的识别行动架构及其相关的 RGB 视频。我们的结果显示,与RGB相比,AFD的所有架构的性能明显下降。我们还与人类开展了一项补充性研究,表明其在AFD和RGB上的识别准确度非常相似,而且比AFD的评价架构要好得多。我们的成果激励了一个新的架构,在当代最佳设计中恢复了光学流和RGB。