Learning light-weight yet expressive deep networks in both image synthesis and image recognition remains a challenging problem. Inspired by a more recent observation that it is the data-specificity that makes the multi-head self-attention (MHSA) in the Transformer model so powerful, this paper proposes to extend the widely adopted light-weight Squeeze-Excitation (SE) module to be spatially-adaptive to reinforce its data specificity, as a convolutional alternative of the MHSA, while retaining the efficiency of SE and the inductive basis of convolution. It presents two designs of spatially-adaptive squeeze-excitation (SASE) modules for image synthesis and image recognition respectively. For image synthesis tasks, the proposed SASE is tested in both low-shot and one-shot learning tasks. It shows better performance than prior arts. For image recognition tasks, the proposed SASE is used as a drop-in replacement for convolution layers in ResNets and achieves much better accuracy than the vanilla ResNets, and slightly better than the MHSA counterparts such as the Swin-Transformer and Pyramid-Transformer in the ImageNet-1000 dataset, with significantly smaller models.
翻译:在图像合成和图像识别两个方面,学习轻量度但表现明显的深层网络仍是一个具有挑战性的问题。由于最近观察到的是数据特性使变异模型中多头自留(MHSA)功能如此强大,本文件建议扩大广泛采用的轻量缩压Exacation(SE)模块,使之在空间上适应性,以加强其数据特性,作为MHSA的变异替代品,同时保持SE和演化基础的效率。它提出了两种空间适应性挤压刺激(SSASE)模块的设计,分别用于图像合成和图像识别。就图像合成任务而言,拟议的SASE在低射和一发学习任务中都经过测试。它比以往的艺术表现更好。关于图像识别任务,SASE是作为ResNet的递增层的投射替代,比Vanilla ResNet的精确度要高得多,比MHSA的对应部门,例如Swin-Trading和Pyramid-Transifor,比图像网络中小得多。