In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed mechanisms. Such heuristic designs are not only complicated but also make models easily overfit specific data distributions. To avoid this, we first propose TransVG, which establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates. We empirically show that complicated fusion modules can be replaced by a simple stack of Transformer encoder layers with higher performance. However, the core fusion Transformer in TransVG is stand-alone against uni-modal encoders, and thus should be trained from scratch on limited visual grounding data, which makes it hard to be optimized and leads to sub-optimal performance. To this end, we further introduce TransVG++ to make two-fold improvements. For one thing, we upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding. For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers. We conduct extensive experiments on five prevalent datasets, and report a series of state-of-the-art records.
翻译:在这项工作中,我们探索了简洁而有效的基于变异器的视觉定位框架。 以往的方法一般地解决视觉定位的核心问题, 即多模式融合和推理, 并使用手工设计的机制。 这种超光速设计不仅复杂, 而且还使模型容易地过度配置特定的数据分布。 为了避免这一点, 我们首先建议 TransVG, 由变异器建立多模式对应, 并通过直接回溯框坐标将推荐的区域本地化。 我们的经验显示, 复杂的聚合模块可以用简单的变异器编码层替换, 其性能更高。 然而, TransVG 的核心变异器是独立针对单式变异器的, 因而应该从有限的变异模型数据上从零开始培训, 这使得它很难优化, 并导致亚优性性性性性性性性能。 为此, 我们进一步引入 TransVG+++, 来做出两重改进。 首先, 我们将我们的框架升级为纯粹的变异器化器结构, 利用VIT( Viververger) 来利用常规变异变异变异变换视野。 我们设计了常规变换五级的常规变的变异变变的变模式。