Patch-based models, e.g., Vision Transformers (ViTs) and Mixers, have shown impressive results on various visual recognition tasks, alternating classic convolutional networks. While the initial patch-based models (ViTs) treated all patches equally, recent studies reveal that incorporating inductive bias like spatiality benefits the representations. However, most prior works solely focused on the location of patches, overlooking the scene structure of images. Thus, we aim to further guide the interaction of patches using the object information. Specifically, we propose OAMixer (object-aware mixing layer), which calibrates the patch mixing layers of patch-based models based on the object labels. Here, we obtain the object labels in unsupervised or weakly-supervised manners, i.e., no additional human-annotating cost is necessary. Using the object labels, OAMixer computes a reweighting mask with a learnable scale parameter that intensifies the interaction of patches containing similar objects and applies the mask to the patch mixing layers. By learning an object-centric representation, we demonstrate that OAMixer improves the classification accuracy and background robustness of various patch-based models, including ViTs, MLP-Mixers, and ConvMixers. Moreover, we show that OAMixer enhances various downstream tasks, including large-scale classification, self-supervised learning, and multi-object recognition, verifying the generic applicability of OAMixer
翻译:光学变换器(View Trangers)和混集器等基于补丁模型(ViTs)在各种视觉识别任务上展示了令人印象深刻的结果,交替了经典革命网络。虽然最初的基于补丁模型(ViTs)对所有补丁都一视同仁,但最近的研究表明,包含空间性等感知偏差的隐含性偏差使表达方式受益。然而,大多数先前的工作都仅仅侧重于补丁点的位置,忽略图像的场景结构。因此,我们的目标是利用对象信息进一步指导补丁的相互作用。具体地说,我们提议OAMxer(目标识识识识识的混合层)来校准基于对象标签的基于补丁的通用模型的补丁混合层。在这里,我们以不受监督或弱度超强的超强超强超强超强的超链接方式获得对象标签。 我们通过学习对象-核心的精确度(OAmblix)的精确度和精确度(OAmblix)分析模型,我们展示了OA-Clix)的自我分解。