It has been observed that visual classification models often rely mostly on the image background, neglecting the foreground, which hurts their robustness to distribution changes. To alleviate this shortcoming, we propose to monitor the model's relevancy signal and manipulate it such that the model is focused on the foreground object. This is done as a finetuning step, involving relatively few samples consisting of pairs of images and their associated foreground masks. Specifically, we encourage the model's relevancy map (i) to assign lower relevance to background regions, (ii) to consider as much information as possible from the foreground, and (iii) we encourage the decisions to have high confidence. When applied to Vision Transformer (ViT) models, a marked improvement in robustness to domain shifts is observed. Moreover, the foreground masks can be obtained automatically, from a self-supervised variant of the ViT model itself; therefore no additional supervision is required.
翻译:据观察,视觉分类模型往往主要依赖图像背景,忽视前景,从而损害其对分布变化的稳健性。为了减轻这一缺陷,我们提议监测模型的适切性信号并操纵它,使模型侧重于前景对象。这是一个微调步骤,涉及由图像配对组成的相对较少的样品及其相关的前景面罩。具体地说,我们鼓励模型的适切性图(一) 给背景区域分配较低的相关性,(二) 尽可能多地考虑来自前景区域的信息,以及(三) 我们鼓励作出高度信任的决定。当应用到视野变形器模型时,观察到对域变形的稳健性明显改善。此外,从维特模型本身的自我监督变式中可以自动获得地面面罩;因此不需要额外的监督。