Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding. However, the encoder-only grounding framework (e.g., TransVG) suffers from heavy computation due to the self-attention operation with quadratic time complexity. To address this issue, we present a new multimodal transformer architecture, coined as Dynamic MDETR, by decoupling the whole grounding process into encoding and decoding phases. The key observation is that there exists high spatial redundancy in images. Thus, we devise a new dynamic multimodal transformer decoder by exploiting this sparsity prior to speed up the visual grounding process. Specifically, our dynamic decoder is composed of a 2D adaptive sampling module and a text-guided decoding module. The sampling module aims to select these informative patches by predicting the offsets with respect to a reference point, while the decoding module works for extracting the grounded object information by performing cross attention between image features and text features. These two modules are stacked alternatively to gradually bridge the modality gap and iteratively refine the reference point of grounded object, eventually realizing the objective of visual grounding. Extensive experiments on five benchmarks demonstrate that our proposed Dynamic MDETR achieves competitive trade-offs between computation and accuracy. Notably, using only 9% feature points in the decoder, we can reduce ~44% GLOPs of the multimodal transformer, but still get higher accuracy than the encoder-only counterpart. In addition, to verify its generalization ability and scale up our Dynamic MDETR, we build the first one-stage CLIP empowered visual grounding framework, and achieve the state-of-the-art performance on these benchmarks.
翻译:多式变压器具有高度的能力和灵活性,可以对图像和文字进行调和,以便进行视觉地面定位。然而,由于具有二次时间复杂性的自我注意操作,只使用编码器的地面框架(如 TransVG)的计算很重。为了解决这个问题,我们提出了一个新的多式联运变压器结构,将整个地面进程分为编码和解码阶段,将整个地面进程分为动态MDETTR,将整个地面进程分为编码和解码阶段。关键观察是图像存在高度的空间冗余。因此,我们设计了新的动态多式联运变压器解码器,在加速视觉地面定位进程之前利用这种宽度的宽度变压框架(如TranVG),具体来说,我们的动态解码器由2D调调制取样模块和文本制解码模块组成。抽样模块的目的是通过预测参照点的偏差来选择这些内容上的补差,而解码模块则通过在图像的首次注意和文本特性特征之间进行提取基底物体信息。这两个模模模模模模化,用来逐渐缩小模式的距离,并反复地改进基础的地面变压变压变压变压变压变压变压变压变压变压工具的能力,最终的变压模型,最终的变压的变压的变压的变压的变压的变压的变压率,最后的变压的变压的变压的变压的变压的变压的变压的变压的变压的变比。