We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection. Key to our architecture is an efficient multi-scale fusion module that unifies the disparate localization requirements across the tasks. In addition, we discover that a standard object detector is surprisingly effective in unifying these tasks without a need for task-specific design, losses, or pre-computed detections. Our end-to-end trainable framework responds flexibly and accurately to a wide range of referring expression, localization or detection queries for zero, one, or multiple objects. Jointly trained on these tasks, FindIt outperforms the state of the art on both referring expression and text-based localization, and shows competitive performance on object detection. Finally, FindIt generalizes better to out-of-distribution data and novel categories compared to strong single-task baselines. All of these are accomplished by a single, unified and efficient model. The code will be released.
翻译:我们建议建立一个简单和多功能的框架, 整合各种视觉定位和本地化任务, 包括推荐表达理解、 文本本地化和对象检测。 我们架构的关键是一个高效的多级集成模块, 统一不同的任务本地化要求。 此外, 我们发现标准对象检测器在统一这些任务方面非常有效, 不需要特定任务设计、 损失或预计算检测。 我们的端到端可培训框架灵活和准确地回应一系列广泛的引用表达、 本地化或检测查询, 包括零、 一或多个对象。 联合培训这些任务, 发现它超越了引用表达和文本本地化的艺术状态, 并展示了对象检测的竞争性性能。 最后, 发现它比强的单一任务基准更好地概括了分配数据和小类。 所有这些都是由单一、 统一和高效的模式完成的。 代码将会被发布 。