We propose UniTAB that Unifies Text And Box outputs for grounded vision-language (VL) modeling. Grounded VL tasks such as grounded captioning require the model to generate a text description and align predicted words with object regions. To achieve this, models must generate desired text and box outputs together, and meanwhile indicate the alignments between words and boxes. In contrast to existing solutions that use multiple separate modules for different outputs, UniTAB represents both text and box outputs with a shared token sequence, and introduces a special <obj> token to naturally indicate word-box alignments in the sequence. UniTAB thus could provide a more comprehensive and interpretable image description, by freely grounding generated words to object regions. On grounded captioning, UniTAB presents a simpler solution with a single output head, and significantly outperforms state of the art in both grounding and captioning evaluations. On general VL tasks that have different desired output formats (i.e., text, box, or their combination), UniTAB with a single network achieves better or comparable performance than task-specific state of the art. Experiments cover 7 VL benchmarks, including grounded captioning, visual grounding, image captioning, and visual question answering. Furthermore, UniTAB's unified multi-task network and the task-agnostic output sequence design make the model parameter efficient and generalizable to new tasks.
翻译:我们建议 UnitaB 统一文本和框输出, 用于有根视觉语言( VL) 建模。 有基的字幕等有基的 VL 任务需要模型来生成文本描述和与目标区域相匹配的预言词。 要实现这一目标, 模型必须同时生成理想文本和框输出, 同时显示单词和框之间的匹配。 与对不同输出使用多个不同模块的现有解决方案相比, UnitaB 代表了文本和框输出, 并带有一个共同符号序列, 并引入了一个特殊的 < obj> 符号, 以自然显示顺序中的单字框对齐。 UniTAB 可以通过将生成的单词与目标区域自由定位, 提供更加全面和可解释的图像描述。 在有根的字幕描述上, UniTAB 提供了一个简单的解决方案, 以单一输出头, 并显著地优于对不同期望输出格式( 即文本、 框或组合) 的通用网络比具体任务状态更好或可比的功能。 UnitaB 实验了 7 VL 高效的图像, 包括基于常规的图像设计、 和图像 基础的图像 和图像 。