Referring Expression Comprehension (REC) requires models to localize objects in images based on natural language descriptions. Research on the area remains predominantly English-centric, despite increasing global deployment demands. This work addresses multilingual REC through two main contributions. First, we construct a unified multilingual dataset spanning 10 languages, by systematically expanding 12 existing English REC benchmarks through machine translation and context-based translation enhancement. The resulting dataset comprises approximately 8 million multilingual referring expressions across 177,620 images, with 336,882 annotated objects. Second, we introduce an attention-anchored neural architecture that uses multilingual SigLIP2 encoders. Our attention-based approach generates coarse spatial anchors from attention distributions, which are subsequently refined through learned residuals. Experimental evaluation demonstrates competitive performance on standard benchmarks, e.g. achieving 86.9% accuracy at IoU@50 on RefCOCO aggregate multilingual evaluation, compared to an English-only result of 91.3%. Multilingual evaluation shows consistent capabilities across languages, establishing the practical feasibility of multilingual visual grounding systems. The dataset and model are available at $\href{https://multilingual.franreno.com}{multilingual.franreno.com}$.
翻译:指代表达理解任务要求模型根据自然语言描述在图像中定位物体。尽管全球部署需求日益增长,该领域的研究仍以英语为中心。本研究通过两项主要贡献解决多语言指代表达理解问题。首先,我们通过机器翻译和基于上下文的翻译增强,系统性地扩展了12个现有英语指代表达理解基准数据集,构建了一个涵盖10种语言的统一多语言数据集。该数据集包含约800万条多语言指代表达,覆盖177,620张图像中的336,882个标注物体。其次,我们提出了一种基于注意力锚定的神经网络架构,采用多语言SigLIP2编码器。该注意力驱动方法从注意力分布中生成粗粒度空间锚点,并通过学习残差进行精细化调整。实验评估表明,在标准基准测试中取得了具有竞争力的性能,例如在RefCOCO多语言聚合评估中IoU@50达到86.9%的准确率,而仅使用英语的基准结果为91.3%。多语言评估显示模型在不同语言间具有一致的能力,证实了多语言视觉定位系统的实际可行性。数据集与模型已发布于 $\href{https://multilingual.franreno.com}{multilingual.franreno.com}$。