Spatiotemporal feature learning in videos is a fundamental problem in computer vision. This paper presents a new architecture, termed as Appearance-and-Relation Network (ARTNet), to learn video representation in an end-to-end manner. ARTNets are constructed by stacking multiple generic building blocks, called as SMART, whose goal is to simultaneously model appearance and relation from RGB input in a separate and explicit manner. Specifically, SMART blocks decouple the spatiotemporal learning module into an appearance branch for spatial modeling and a relation branch for temporal modeling. The appearance branch is implemented based on the linear combination of pixels or filter responses in each frame, while the relation branch is designed based on the multiplicative interactions between pixels or filter responses across multiple frames. We perform experiments on three action recognition benchmarks: Kinetics, UCF101, and HMDB51, demonstrating that SMART blocks obtain an evident improvement over 3D convolutions for spatiotemporal feature learning. Under the same training setting, ARTNets achieve superior performance on these three datasets to the existing state-of-the-art methods.
翻译:视频的外观性能学习是计算机视觉的一个基本问题。 本文展示了一个新的结构, 称为外观和关系网络( ARTNet ), 以端到端的方式学习视频演示。 ARTNet 是由堆叠多个通用建筑块建造的, 称为 SMART, 目标是以单独和明确的方式同时建模外观和与 RGB 输入的关系。 具体地说, SMART 区块将空间时观学习模块拆成空间建模的外观分支和时间建模的关联分支。 外观分支是根据每个框架中像素或过滤器反应的线性组合执行的, 而关系分支则以多个框架的像素或过滤器反应之间的多复制性互动为基础设计。 我们根据三种行动识别基准进行实验: 动因学、 UCFF101 和 HMDB51, 表明 SMART 区块在3 进程中获得了显著的改进, 用于空间建模和时间建模。 在同一培训设置下, ART Nets 在这三个数据集中取得了优异性性功能。