With the emergence of social media, voluminous video clips are uploaded every day, and retrieving the most relevant visual content with a language query becomes critical. Most approaches aim to learn a joint embedding space for plain textual and visual contents without adequately exploiting their intra-modality structures and inter-modality correlations. This paper proposes a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts with an attention scheme to learn the intra- and inter-role correlations among the three roles to discover discriminative features for matching at different levels. The preliminary results on popular YouCook2 indicate that our approach surpasses a current state-of-the-art method, with a high margin in all metrics. It also overpasses two SOTA methods in terms of two metrics.
翻译:随着社交媒体的出现,大量视频片段每天被上传,用语言查询检索最相关的视觉内容变得至关重要。 大多数方法都旨在学习一个纯文本内容和视觉内容的联合嵌入空间,而没有充分利用它们内部的现代结构和现代相互关系。本文建议了一个新的变压器,将文字和视频明确分解为物体、空间背景和时间背景的语义作用,并有一个关注计划,以了解三个角色之间的内在和功能间关联,以发现不同级别匹配的歧视性特征。流行YouCook2的初步结果显示,我们的方法超越了当前最先进的方法,在所有度量中都有很高的比值。它也超越了两个SOTA方法。