We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference. 3D convolutional neural networks (CNNs) are accurate at video recognition but require large computation and memory budgets and do not support online inference, making them difficult to work on mobile devices. We propose a three-step approach to improve computational efficiency while substantially reducing the peak memory usage of 3D CNNs. First, we design a video network search space and employ neural architecture search to generate efficient and diverse 3D CNN architectures. Second, we introduce the Stream Buffer technique that decouples memory from video clip duration, allowing 3D CNNs to embed arbitrary-length streaming video sequences for both training and inference with a small constant memory footprint. Third, we propose a simple ensembling technique to improve accuracy further without sacrificing efficiency. These three progressive techniques allow MoViNets to achieve state-of-the-art accuracy and efficiency on the Kinetics, Moments in Time, and Charades video action recognition datasets. For instance, MoViNet-A5-Stream achieves the same accuracy as X3D-XL on Kinetics 600 while requiring 80% fewer FLOPs and 65% less memory. Code will be made available at https://github.com/tensorflow/models/tree/master/official/vision.
翻译:我们展示了移动视频网络(MoViNets),这是一个计算和记忆高效视频网络,可以在流视频上运行,进行在线推断。 3D进化神经网络(CNNs)在视频识别上是准确的,但需要大量计算和记忆预算,不支持在线推断,使其难以在移动设备上工作。我们建议采取三步方法提高计算效率,同时大幅降低3DCNN的峰值使用率。首先,我们设计了一个视频网络搜索空间,并使用600个神经结构搜索,以生成高效和多样化的3DCNN结构。第二,我们引入了从视频剪存存中分离记忆的Stream Buffer技术,允许 3DCNN将任意长的视频流序列嵌入培训和推断,同时使用少量的不断记忆足迹。第三,我们建议一种简单的套装技术来提高准确性,同时不牺牲效率。这三种渐进式技术使MViNets能够实现Kinitical/netral 准确性/compatics, Moint, and Charades Vreadal OP-DalDs drestryards.