In this paper, we provide a deep analysis of temporal modeling for action recognition, an important but underexplored problem in the literature. We first propose a new approach to quantify the temporal relationships between frames captured by CNN-based action models based on layer-wise relevance propagation. We then conduct comprehensive experiments and in-depth analysis to provide a better understanding of how temporal modeling is affected by various factors such as dataset, network architecture, and input frames. With this, we further study some important questions for action recognition that lead to interesting findings. Our analysis shows that there is no strong correlation between temporal relevance and model performance; and action models tend to capture local temporal information, but less long-range dependencies. Our codes and models will be publicly available.
翻译:在本文中,我们深入分析了行动识别的时间模型,这是文献中一个重要但未得到充分探讨的问题。我们首先提出一种新的方法,根据层次相关性的传播,量化有线电视新闻网行动模型所捕捉的框架之间的时间关系。然后,我们进行全面试验和深入分析,以便更好地了解时间模型如何受到数据集、网络架构和输入框架等各种因素的影响。这样,我们进一步研究了一些重要的行动识别问题,从而得出有趣的结论。我们的分析表明,时间相关性与模型性能之间没有密切的关联;行动模型往往捕捉到当地的时间信息,但较少长期依赖性。我们的代码和模型将公开提供。