视觉输入中多语言表达对目标物体指代的理解 (Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs)

Referring Expression Comprehension (REC) requires models to localize objects in images based on natural language descriptions. Research on the area remains predominantly English-centric, despite increasing global deployment demands. This work addresses multilingual REC through two main contributions. First, we construct a unified multilingual dataset spanning 10 languages, by systematically expanding 12 existing English REC benchmarks through machine translation and context-based translation enhancement. The resulting dataset comprises approximately 8 million multilingual referring expressions across 177,620 images, with 336,882 annotated objects. Second, we introduce an attention-anchored neural architecture that uses multilingual SigLIP2 encoders. Our attention-based approach generates coarse spatial anchors from attention distributions, which are subsequently refined through learned residuals. Experimental evaluation demonstrates competitive performance on standard benchmarks, e.g. achieving 86.9% accuracy at IoU@50 on RefCOCO aggregate multilingual evaluation, compared to an English-only result of 91.3%. Multilingual evaluation shows consistent capabilities across languages, establishing the practical feasibility of multilingual visual grounding systems. The dataset and model are available at $\href{https://multilingual.franreno.com}{multilingual.franreno.com}$.

翻译：指代表达理解任务要求模型根据自然语言描述在图像中定位物体。尽管全球部署需求日益增长，该领域的研究仍以英语为中心。本研究通过两项主要贡献解决多语言指代表达理解问题。首先，我们通过机器翻译和基于上下文的翻译增强，系统性地扩展了12个现有英语指代表达理解基准数据集，构建了一个涵盖10种语言的统一多语言数据集。该数据集包含约800万条多语言指代表达，覆盖177,620张图像中的336,882个标注物体。其次，我们提出了一种基于注意力锚定的神经网络架构，采用多语言SigLIP2编码器。该注意力驱动方法从注意力分布中生成粗粒度空间锚点，并通过学习残差进行精细化调整。实验评估表明，在标准基准测试中取得了具有竞争力的性能，例如在RefCOCO多语言聚合评估中IoU@50达到86.9%的准确率，而仅使用英语的基准结果为91.3%。多语言评估显示模型在不同语言间具有一致的能力，证实了多语言视觉定位系统的实际可行性。数据集与模型已发布于 $\href{https://multilingual.franreno.com}{multilingual.franreno.com}$。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2024】MoTE：在视觉语言到视频知识转移中协调泛化与专门化

专知会员服务

13+阅读 · 2024年10月16日

UTC: 用于视觉对话的任务间对比学习的统一Transformer

专知会员服务

14+阅读 · 2022年5月4日

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

专知会员服务

44+阅读 · 2020年4月30日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日