In natural images, information is conveyed at different frequencies where higher frequencies are usually encoded with fine details and lower frequencies are usually encoded with global structures. Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies. In this work, we propose to factorize the mixed feature maps by their frequencies and design a novel Octave Convolution (OctConv) operation to store and process feature maps that vary spatially "slower" at a lower spatial resolution reducing both memory and computation cost. Unlike existing multi-scale meth-ods, OctConv is formulated as a single, generic, plug-and-play convolutional unit that can be used as a direct replacement of (vanilla) convolutions without any adjustments in the network architecture. It is also orthogonal and complementary to methods that suggest better topologies or reduce channel-wise redundancy like group or depth-wise convolutions. We experimentally show that by simply replacing con-volutions with OctConv, we can consistently boost accuracy for both image and video recognition tasks, while reducing memory and computational cost. An OctConv-equipped ResNet-152 can achieve 82.9% top-1 classification accuracy on ImageNet with merely 22.2 GFLOPs.
翻译:在自然图像中,信息传递频率不同,不同频率通常以精细细节编码,低频率通常与全球结构编码,同样,卷发层的输出特征图也可以视为不同频率信息的一种混合。在这项工作中,我们提议将混合特征图按频率进行分解,并设计一种新型的“OctConv” 操作,以存储和处理空间“较低”的特征图,以降低记忆和计算成本的较低空间分辨率存储和处理。与现有的多尺度甲状腺类动物不同,CentConv是作为一个单一的、通用的、插件和游戏的共变单元,可以用来直接替换(Vanilla)的相变图,而无需对网络结构作任何调整。我们实验性地表明,通过仅仅用CentConvol取代共振动,我们可以不断提高图像和视频识别任务的准确度,同时降低记忆和计算成本。Ocncal-OP-FLA 22 ResNet, 仅能用高精度的GFL. 2。