We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation mechanism for modeling token interactions in vision. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to inject the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational costs on the tasks of image classification, object detection, and segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K in 2242 resolution, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 2242 and 3842, respectively. When transferred to downstream tasks, FocalNets exhibit clear superiority. For object detection with Mask R-CNN, FocalNet base trained with 1\times outperforms the Swin counterpart by 2.1 points and already surpasses Swin trained with 3\times schedule (49.0 v.s. 48.5). For semantic segmentation with UPerNet, FocalNet base at single-scale outperforms Swin by 2.4, and beats Swin at multi-scale (50.5 v.s. 49.7). Using large FocalNet and Mask2former, we achieve 58.5 mIoU for ADE20K semantic segmentation, and 57.9 PQ for COCO Panoptic Segmentation. Using huge FocalNet and DINO, we achieved 64.2 and 64.3 mAP on COCO minival and test-dev, respectively, establishing new SoTA on top of much larger attention-based models like Swinv2-G and BEIT-3. Code is available at \url{https://github.com/microsoft/FocalNet}.
翻译:我们提议建立核心调制网络(FocalNets 简称为FocalNets ), 由一个用于模拟视觉中象征性互动的焦点调制机制完全取代了50-50-50网络(SA)。 焦点调制包括三个组成部分:(一) 等级背景化,使用一系列深度相向的组合层,将视觉背景从短程到长程进行编码,(二) 门式聚合,有选择地收集每个查询符号的背景,基于其内容,以及(三) 元素式自动调制或折叠式转换,将汇总的内脏数据输入查询。广泛的实验显示CoilNets 大于状态的调制调制机制(例如,Swin和Coil-30),在图像分类、物体探测和分解分解方面,具有类似的计算成本。 具体地说,小和基底尺寸的CoalNets 达到82. 和83-isteriaxyl, 在目标2242和3842-N对目标进行精细调时,通过Sload-ral-modal-modal Scial-modal Smarisal-modals, 通过Slod-modeal-modeal-modeal-ration, 通过Slodal-moxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx