Spatiotemporal feature learning in videos is a fundamental and difficult problem in computer vision. This paper presents a new architecture, termed as Appearance-and-Relation Network (ARTNet), to learn video representation in an end-to-end manner. ARTNets are constructed by stacking multiple generic building blocks, called as SMART, whose goal is to simultaneously model appearance and relation from RGB input in a separate and explicit manner. Specifically, SMART blocks decouple the spatiotemporal learning module into an appearance branch for spatial modeling and a relation branch for temporal modeling. The appearance branch is implemented based on the linear combination of pixels or filter responses in each frame, while the relation branch is designed based on the multiplicative interactions between pixels or filter responses across multiple frames. We perform experiments on three action recognition benchmarks: Kinetics, UCF101, and HMDB51, demonstrating that SMART blocks obtain an evident improvement over 3D convolutions for spatiotemporal feature learning. Under the same training setting, ARTNets achieve superior performance on these three datasets to the existing state-of-the-art methods.
翻译:视频的外观性能学习是计算机视觉中一个根本性和困难的问题。 本文展示了一个新的结构, 称为外观和关系网络( ARTNet), 以端到端的方式学习视频演示。 ARTNet 是由堆叠多个通用建筑块建造的, 称为 SMART, 目标是以单独和明确的方式同时模拟外观和 RGB 输入的关系。 具体地说, SMART 区块将空间建模和时间建模关系分支的双时相学习模块拆成一个外观分支。 外观分支基于每个框架的像素或过滤器反应的线性组合, 而关系分支则以多个框架的像素或过滤器反应之间的多复制性互动为基础设计。 我们根据三种行动识别基准进行实验: 肯亚学、 UCFF101 和 HMDB51, 表明 SMART 区块在3D 相变相中获得了显著的改进, 用于空间建模和时间建模。 在同一培训环境中, ARTNet 在这三个数据集中, 取得了优异性功能。