We propose a margin-based loss for vision-language model pretraining that encourages gradient-based explanations that are consistent with region-level annotations. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding performance compared to models that rely instead on region-level annotations for explicitly training an object detector such as Faster R-CNN. AMC works by encouraging gradient-based explanation masks that focus their attention scores mostly within annotated regions of interest for images that contain such annotations. Particularly, a model trained with AMC on top of standard vision-language modeling objectives obtains a state-of-the-art accuracy of 86.59% in the Flickr30k visual grounding benchmark, an absolute improvement of 5.48% when compared to the best previous model. Our approach also performs exceedingly well on established benchmarks for referring expression comprehension and offers the added benefit by design of gradient-based explanations that better align with human annotations.
翻译:我们建议为愿景语言模型预培训提供基于边际的损失,鼓励与区域层面说明相一致的基于梯度的解释;我们将这一目标称为“注意面罩常识”,并表明与区域层面说明明确培训像“更快R-CNN”这样的物体探测器相比,它产生优异的目视定位性性能,而区域层面说明则直接用于培训“更快R-CNN”这样的物体探测器。 AMC鼓励以梯度为基础的解释面罩,主要在附加说明的带有此类说明的图像区域中引起关注。 特别是,在标准愿景语言模型目标上,与AMC公司培训的模型在Flickr-30k视觉定位基准中获得了最先进的准确度为86.59%的精确度,与以往的最佳模型相比,绝对提高了5.48%的精确度。 我们的方法还非常出色地运用了参考表达理解的既定基准,并通过设计更符合人类说明的基于梯度的解释来带来额外好处。