Learning fine-grained interplay between vision and language allows to a more accurate understanding for VisionLanguage tasks. However, it remains challenging to extract key image regions according to the texts for semantic alignments. Most existing works are either limited by textagnostic and redundant regions obtained with the frozen detectors, or failing to scale further due to its heavy reliance on scarce grounding (gold) data to pre-train detectors. To solve these problems, we propose Self-Locator Aided Network (SLAN) for cross-modal understanding tasks without any extra gold data. SLAN consists of a region filter and a region adaptor to localize regions of interest conditioned on different texts. By aggregating cross-modal information, the region filter selects key regions and the region adaptor updates their coordinates with text guidance. With detailed region-word alignments, SLAN can be easily generalized to many downstream tasks. It achieves fairly competitive results on five cross-modal understanding tasks (e.g., 85.7% and 69.2% on COCO image-to-text and text-to-image retrieval, surpassing previous SOTA methods). SLAN also demonstrates strong zero-shot and fine-tuned transferability to two localization tasks.
翻译:视觉和语言之间的细微学习互动有助于更准确地理解视觉语言任务。然而,根据语义校正文本,要根据关键图像区域仍很难根据语义校正文本进行提取关键图像区域。大多数现有工程要么受到通过冷冻探测器获得的文本性区域和冗余区域的限制,要么由于严重依赖稀缺的地基(金)数据来进行预先培训检测,未能进一步扩展规模。为了解决这些问题,我们提议为跨模式理解任务建立自我定位辅助网络(SLAN),无需任何额外的黄金数据。 SLAN包括一个区域过滤器和一个区域调整器,以基于不同文本将感兴趣的区域本地化为条件。通过汇总跨模式信息,区域过滤器选择关键区域和区域,根据文本指南调整或更新其坐标。通过详细的区域语言校正,SLAN很容易被广泛用于许多下游任务。在五项跨模式的谅解任务(例如,85.7%和69.2%的CO图像到文字的微调两度检索,超越了SOTA的本地性调整任务)上取得了相当有竞争力的结果。