We introduce Continual 3D Convolutional Neural Networks (Co3D CNNs), a new computational formulation of spatio-temporal 3D CNNs, in which videos are processed frame-by-frame rather than by clip. In online tasks demanding frame-wise predictions, Co3D CNNs dispense with the computational redundancies of regular 3D CNNs, namely the repeated convolutions over frames, which appear in overlapping clips. We show that Continual 3D CNNs can reuse preexisting 3D-CNN weights to reduce the per-prediction floating point operations (FLOPs) in proportion to the temporal receptive field while retaining similar memory requirements and accuracy. This is validated with multiple models on Kinetics-400 and Charades with remarkable results: CoX3D models attain state-of-the-art complexity/accuracy trade-offs on Kinetics-400 with 12.1-15.3x reductions of FLOPs and 2.3-3.8% improvements in accuracy compared to regular X3D models while reducing peak memory consumption by up to 48%. Moreover, we investigate the transient response of Co3D CNNs at start-up and perform extensive benchmarks of on-hardware processing characteristics for publicly available 3D CNNs.
翻译:我们引入了连续3D进化神经网络(Co3D CNN),这是对时空3DCNN的新的计算配方,其中视频按框架而不是剪辑进行处理。在要求框架预测的在线任务中,C3DCNN免除了正常3DCNN的计算冗余,即反复变换框架,这些变换出现在重叠的剪辑中。我们显示,连续3DCNN可以重复原有的3D-CNN重量,以在保持类似的记忆要求和准确性的同时,减少与时间可接受场的移动点作业(FLOPs)的比例,同时保持类似的存储要求和准确性。这通过关于动因-400和机组的多种模型加以验证:CX3D模型在基电机400上实现了最先进的复杂/准确交易,12.1-15.3x减少了FLOPs,比正常X3D模型的精确度提高了2.3-3.8%,同时将最高峰的存储量减少到48%。此外,我们还研究了CMNM3的透明特性的公开处理基准。