Recently, by introducing large-scale dataset and strong transformer network, video-language pre-training has shown great success especially for retrieval. Yet, existing video-language transformer models do not explicitly fine-grained semantic align. In this work, we present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object representations. The key idea is to leverage the bounding boxes and object tags to guide the training process. We evaluate our model on three standard sub-tasks of video-text matching on four widely used benchmarks. We also provide deep analysis and detailed ablation about the proposed method. We show clear improvement in performance across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a video-language architecture. The code will be released at \url{https://github.com/FingerRec/OA-Transformer}.
翻译:最近,通过引入大型数据集和强大的变压器网络,视频语言预培训取得了巨大成功,特别是在检索方面。然而,现有的视频语言变压器模型并未明确显示精细的语义对齐。在这项工作中,我们展示了以对象为中心的变压器,即以对象为中心的变压器,将视频语言变压器扩展,以纳入物体表示。关键的想法是利用捆绑框和对象标签来指导培训进程。我们评估了我们关于四个广泛使用的基准上三个标准视频文本匹配子任务的模式。我们还就拟议方法提供了深入分析和详细分析。我们展示了所有任务和所考虑数据集的性能明显改进,展示了将物体表示纳入视频语言结构的模式的价值。代码将在\url{https://github.com/FingerRec/OA-Transforent}发布。