集装箱:环境聚合网络 (Container: Context Aggregation Network)

Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers -- originally introduced in natural language processing -- have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising finding shows that a simple MLP based solution without any traditional convolutional or Transformer components can produce effective visual representations. While CNNs, Transformers and MLP-Mixers may be considered as completely disparate architectures, we provide a unified view showing that they are in fact special cases of a more general method to aggregate spatial context in a neural network stack. We present the \model (CONText AggregatIon NEtwoRk), a general-purpose building block for multi-head context aggregation that can exploit long-range interactions \emph{a la} Transformers while still exploiting the inductive bias of the local convolution operation leading to faster convergence speeds, often seen in CNNs. In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named \modellight, can be employed in object detection and instance segmentation networks such as DETR, RetinaNet and Mask-RCNN to obtain an impressive detection mAP of 38.9, 43.8, 45.1 and mask mAP of 41.3, providing large improvements of 6.6, 7.3, 6.9 and 6.6 pts respectively, compared to a ResNet-50 backbone with a comparable compute and parameter size. Our method also achieves promising results on self-supervised learning compared to DeiT on the DINO framework.

翻译：在计算机视野中,基于简单 MLP 解决方案的计算机神经网络(CNN NN)无处不在,且有各种各样的有效和高效的变异。最近,最初在自然语言处理中引入的变异器在计算机视野中日益被采用。虽然早期采用者继续使用CNN的骨干,但最新的网络是无CNN的变异器解决方案。最近一个令人惊讶的发现显示,一个没有传统变异或变异组件的简单 MLP 解决方案可以产生有效的视觉表现。尽管CNN、变异器和MLP-Mixers可能被视为完全不同的结构,但我们提供了一种统一的观点,表明它们实际上是一种在神经网络堆中采用更通用的集成空间背景方法。我们展示了模型(CONText Agregation NeworkRk),这是多头环境组合的一个通用建筑块,可以利用长距离互动和变异变变变器元元组件来生成有效的视觉图像。与此同时,人们经常看到CNNFNFS的变现变现变式变压式变现变式目标网络的变式变式变式变式变式变式变式变式变式变式变式变式变式变换式变换式变式变式变式变式变换式变式变式变式变换式变式变式变式变换式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变式变

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【斯坦福&Facebook】生成式对抗变换器，Generative Adversarial Transformers

专知会员服务

21+阅读 · 2021年4月21日

【ICLR 2019】双曲注意力网络，Hyperbolic Attention Network

专知会员服务

84+阅读 · 2020年6月21日

【剑桥大学】图网络的主邻域聚合，Principal Neighbourhood Aggregation for Graph Nets

专知会员服务

42+阅读 · 2020年4月22日

抢鲜看！13篇CVPR2020论文链接/开源代码/解读

专知会员服务

50+阅读 · 2020年2月26日