Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers -- originally introduced in natural language processing -- have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising finding shows that a simple MLP based solution without any traditional convolutional or Transformer components can produce effective visual representations. While CNNs, Transformers and MLP-Mixers may be considered as completely disparate architectures, we provide a unified view showing that they are in fact special cases of a more general method to aggregate spatial context in a neural network stack. We present the \model (CONText AggregatIon NEtwoRk), a general-purpose building block for multi-head context aggregation that can exploit long-range interactions \emph{a la} Transformers while still exploiting the inductive bias of the local convolution operation leading to faster convergence speeds, often seen in CNNs. In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named \modellight, can be employed in object detection and instance segmentation networks such as DETR, RetinaNet and Mask-RCNN to obtain an impressive detection mAP of 38.9, 43.8, 45.1 and mask mAP of 41.3, providing large improvements of 6.6, 7.3, 6.9 and 6.6 pts respectively, compared to a ResNet-50 backbone with a comparable compute and parameter size. Our method also achieves promising results on self-supervised learning compared to DeiT on the DINO framework. Code is released at \url{https://github.com/allenai/container}.
翻译:在计算机视觉中,基于简单 MLP 的解决方案在计算机视觉中普遍存在,具有各种各样的效力和效率变异。最近,在自然语言处理中引入的变异器在计算机视觉中日益被采用。虽然早期采用者继续使用CNN的骨干,但最新的网络是无CNN的变异器解决方案。最近一个令人惊讶的发现显示,一个没有传统变异或变异组件的简单 MLP 解决方案可以产生有效的视觉表现。尽管CNN、变异器和MLP-Mixer可能被视为完全不同的架构,但我们提供了一种统一的观点,表明它们事实上是一种在神经网络堆中综合空间背景的更通用方法的特殊案例。我们展示了模型(CONText Agregation NeworkRk),这是多头环境组合的一个通用建筑块,可以利用远程互动\ emph{i} la}变异体,同时利用本地变变变码操作的感知性变变形框架导致更快的趋近的趋同速度,通常见于CNNFNFAR的变变变变变码网络, 和变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式的网络的变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变