This paper introduces Continual 3D Convolutional Neural Networks (Co3D CNNs), a new computational formulation of spatio-temporal 3D CNNs, in which videos are processed frame-by-frame rather than by clip. In online processing tasks demanding frame-wise predictions, Co3D CNNs dispense with the computational redundancies of regular 3D CNNs, namely the repeated convolutions over frames, which appear in overlapping clips. We show that Continual 3D CNNs can reuse preexisting 3D-CNN weights to reduce the per-prediction floating point operations (FLOPs) in proportion to the temporal receptive field while retaining similar memory requirements and accuracy. This is validated with multiple models on the Kinetics-400 and Charades datasets with remarkable results: Continual X3D models attain state-of-the-art complexity/accuracy trade-offs on Kinetics-400 with 12.1-15.3x reductions of FLOPs and 2.3-3.8% improvements in accuracy compared to regular X3D models while reducing peak memory consumption by up to 48%. Moreover, we investigate the transient response of Co3D CNNs at start-up and perform an extensive benchmark of on-hardware processing speed and accuracy for publicly available 3D CNNs.
翻译:本文介绍了连续3D进化神经网络(Co3D CNNs),这是一种新的计算式组合式组合式组合3D时空CNN,其中视频按框架而不是剪辑处理。在在线处理任务中,Co3DCNN将常规3D进化神经网络的计算冗余(即,在重复的剪辑中出现连续的变异)从框上重复出现。我们显示,连续3DCNN可以将原有的3D-CNN重量再利用,以在保持类似的存储要求和准确性的同时,将视频按框架而不是按剪辑处理。在网上处理要求作出框架预测的任务中,Co3DCNN免除了常规3D的计算冗余,即:连续X3D模型在基雅化-400上实现最新的复杂性/准确性交易,FLOPs减少12-15.3x,与常规X3D浮点(FLOPs)相比,精确度可以比正常的X3D浮点操作(FLOPs)按比例减少,同时将高峰的存储速度降低到48%,同时,同时保持类似的记忆-400和直观3D的快速处理。