Modern convolutional neural networks apply the same operations on every pixel in an image. However, not all image regions are equally important. To address this inefficiency, we propose a method to dynamically apply convolutions conditioned on the input image. We introduce a residual block where a small gating branch learns which spatial positions should be evaluated. These discrete gating decisions are trained end-to-end using the Gumbel-Softmax trick, in combination with a sparsity criterion. Our experiments on CIFAR, ImageNet and MPII show that our method has better focus on the region of interest and better accuracy than existing methods, at a lower computational complexity. Moreover, we provide an efficient CUDA implementation of our dynamic convolutions using a gather-scatter approach, achieving a significant improvement in inference speed with MobileNetV2 residual blocks. On human pose estimation, a task that is inherently spatially sparse, the processing speed is increased by 60% with no loss in accuracy.
翻译:现代共生神经网络对图像中的每个像素应用同样的操作。 但是, 并非所有图像区域都同等重要。 为了解决这种低效率问题, 我们建议了一种方法, 动态应用以输入图像为条件的变异。 我们引入了一个剩余块, 使一个小带宽的分支学习了哪些空间位置应该评估。 这些分立的导形决定是经过训练的端对端, 使用 Gumbel- Softmax 的把戏, 结合一个宽度标准。 我们在 CIFAR、 图像网和 MPII 上进行的实验显示, 我们的方法比现有方法更注重感兴趣的区域, 并且准确性更高。 此外, 我们提供一种高效的 CUDA, 使用集散射法来实施我们的动态变异, 大大改善与 MobalNetV2 剩余区之间的发酵速度。 关于人体表面估计, 一项内在空间稀少的任务, 处理速度增加了60%, 并且没有损失准确性。