In this work, we propose a generally applicable transformation unit for visual recognition with deep convolutional neural networks. This transformation explicitly models channel relationships with explainable control variables. These variables determine the neuron behaviors of competition or cooperation, and they are jointly optimized with the convolutional weight towards more accurate recognition. In Squeeze-and-Excitation (SE) Networks, the channel relationships are implicitly learned by fully connected layers, and the SE block is integrated at the block-level. We instead introduce a channel normalization layer to reduce the number of parameters and computational complexity. This lightweight layer incorporates a simple l2 normalization, enabling our transformation unit applicable to operator-level without much increase of additional parameters. Extensive experiments demonstrate the effectiveness of our unit with clear margins on many vision tasks, i.e., image classification on ImageNet, object detection and instance segmentation on COCO, video classification on Kinetics.
翻译:在这项工作中,我们提出一个普遍适用的变换单元,用于与深层进化神经网络进行视觉识别。这种变换明确地模拟了与可解释的控制变量的关系。这些变数决定了竞争或合作的神经行为,这些变数与进化权重共同优化,以更准确地识别这些神经行为。在Squeze-and-Exucation(Se)网络中,频道关系由完全相连的层层暗中学习,而SE区块则在区块一级被整合。我们相反,我们引入了一个频道正常化层,以减少参数和计算复杂性的数量。这一轻重层包含一个简单的 L2 常规化,使我们的变换单元适用于操作员一级,而无需大量增加额外的参数。广泛的实验表明我们这个单位在很多视觉任务上具有明显边际的效力,即图像网络的图像分类、对COCO的物体探测和实例分割、动因技术的视频分类。