Efficient video architecture is the key to deploying video recognition systems on devices with limited computing resources. Unfortunately, existing video architectures are often computationally intensive and not suitable for such applications. The recent X3D work presents a new family of efficient video models by expanding a hand-crafted image architecture along multiple axes, such as space, time, width, and depth. Although operating in a conceptually large space, X3D searches one axis at a time, and merely explored a small set of 30 architectures in total, which does not sufficiently explore the space. This paper bypasses existing 2D architectures, and directly searched for 3D architectures in a fine-grained space, where block type, filter number, expansion ratio and attention block are jointly searched. A probabilistic neural architecture search method is adopted to efficiently search in such a large space. Evaluations on Kinetics and Something-Something-V2 benchmarks confirm our AutoX3D models outperform existing ones in accuracy up to 1.3% under similar FLOPs, and reduce the computational cost up to x1.74 when reaching similar performance.
翻译:高效视频结构是将视频识别系统安装在计算资源有限的装置上的关键。 不幸的是, 现有的视频结构往往在计算上非常密集, 并且不适合这些应用。 最近的 X3D 工作通过在空间、 时间、 宽度和深度等多个轴上扩展手制图像结构, 呈现出一套新型高效视频模型。 虽然在概念上大的空间运行, X3D 一次搜索一个轴, 并且只是探索了一组由30个小结构组成的、 无法充分探索空间的小型结构。 本文绕过现有的 2D 结构, 直接搜索了微小空间中的 3D 结构, 从而可以共同搜索区块类型、 过滤器号、 扩展率 和 关注区块 。 一种概率性神经结构搜索方法用于在如此大的空域中高效搜索 。 对动能学和某些东西- 点- V2 基准的评估证实了我们的自动X3D 模型在类似 FLOPs 下的准确度为1. 3 3, 并且将计算成本降低到 x1.74 的计算成本 。