This paper introduces Continual 3D Convolutional Neural Networks (Co3D CNNs), a new computational formulation of spatio-temporal 3D CNNs, in which videos are processed frame-by-frame rather than by clip. In online processing tasks demanding frame-wise predictions, Co3D CNNs dispense with the computational redundancies of regular 3D CNNs, namely the repeated convolutions over frames, which appear in multiple clips. While yielding an order of magnitude in computational savings, Co3D CNNs have memory requirements comparable with that of corresponding regular 3D CNNs and are less affected by changes in the size of the temporal receptive field. We show that Continual 3D CNNs initialised on the weights from preexisting state-of-the-art video recognition models reduce the floating point operations for frame-wise computations by 10.0-12.4x while improving accuracy on Kinetics-400 by 2.3-3.8. Moreover, we investigate the transient start-up response of Co3D CNNs and perform an extensive benchmark of online processing speed as well as accuracy for publicly available state-of-the-art 3D CNNs on modern hardware.
翻译:本文介绍连续3D进化神经网络(Co3D CNN),这是对连续3D进化神经网络(Co3D CNN)的一种新的计算式配制,其中视频按框架而不是剪辑处理。在网上处理要求框架预测的任务中,Co3D CNN将常规3D进化神经网络的计算重复冗余(即,在多个剪辑中出现的对框架的反复变换)清除。C3D CNN在计算节省数量上达到一定的幅度时,其记忆要求与相应的经常3D CNN的类似,受时间开放域规模变化的影响较小。我们显示,Co3D CNN在原有最新视频识别模型的权重上,将框架计算浮点操作减少10.0-12.4x,同时将Kinitics-400的精确度提高2.3-3.8。此外,我们调查C3D CNNM的快速启动响应情况,并对网上处理速度进行广泛的基准,以及公开提供的现代硬件的准确性。