Conventional video models rely on a single stream to capture the complex spatial-temporal features. Recent work on two-stream video models, such as SlowFast network and AssembleNet, prescribe separate streams to learn complementary features, and achieve stronger performance. However, manually designing both streams as well as the in-between fusion blocks is a daunting task, requiring to explore a tremendously large design space. Such manual exploration is time-consuming and often ends up with sub-optimal architectures when computational resources are limited and the exploration is insufficient. In this work, we present a pragmatic neural architecture search approach, which is able to search for two-stream video models in giant spaces efficiently. We design a multivariate search space, including 6 search variables to capture a wide variety of choices in designing two-stream models. Furthermore, we propose a progressive search procedure, by searching for the architecture of individual streams, fusion blocks, and attention blocks one after the other. We demonstrate two-stream models with significantly better performance can be automatically discovered in our design space. Our searched two-stream models, namely Auto-TSNet, consistently outperform other models on standard benchmarks. On Kinetics, compared with the SlowFast model, our Auto-TSNet-L model reduces FLOPS by nearly 11 times while achieving the same accuracy 78.9%. On Something-Something-V2, Auto-TSNet-M improves the accuracy by at least 2% over other methods which use less than 50 GFLOPS per video.
翻译:常规视频模型依靠单一流来捕捉复杂的空间时空特征。 最近关于双流视频模型的工作, 如 慢速网络 和 AssembleNet, 规定了不同的流以学习互补功能, 并实现更强的性能。 然而, 手动设计两个流以及融合区块之间, 是一项艰巨的任务, 需要探索巨大的设计空间。 这种手工探索耗时, 并往往在计算资源有限、 探索不足时, 最终形成亚最佳结构。 在这项工作中, 我们提出了一个实用的神经结构搜索方法, 能够有效地在巨大的空间中搜索双流视频模型。 我们设计了一个多变量搜索空间, 包括6个搜索变量, 在设计双流模型时捕捉多种选择。 此外, 我们提出一个渐进的搜索程序, 搜索单个流结构、 融合区块和关注区块之后一个块。 我们在设计空间中可以自动发现两个流模型, 即 自动- TSNet 模型-, 持续地超过50 级视频模型 。 我们设计的双流模型, 持续超过其他模型 几乎超过 78 标准 GL 标准 标准 。 在 上, AS 上, 上, 也降低 使用相同的 。