Seas of videos are uploaded daily with the popularity of social channels; thus, retrieving the most related video contents with user textual queries plays a more crucial role. Most methods consider only one joint embedding space between global visual and textual features without considering the local structures of each modality. Some other approaches consider multiple embedding spaces consisting of global and local features separately, ignoring rich inter-modality correlations. We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels; the roles of spatial contexts, temporal contexts, and object contexts. We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels with mixture-of-experts for considering inter-modalities and structures' correlations. The results indicate that our method outperforms the state-of-the-art methods on the YouCook2 and MSR-VTT datasets, given the same visual backbone without pre-training. Finally, we conducted extensive ablation studies to elucidate our design choices.
翻译:每天在社交频道的流行下上传视频的海洋;因此,通过用户文字查询检索最相关的视频内容具有更关键的作用。大多数方法只考虑全球视觉和文字特征之间的一个联合嵌入空间,而不考虑每种模式的本地结构。其他一些方法考虑由全球和地方特征组成的多个嵌入空间,忽视丰富的现代相互关系。我们建议一种新型专家混合变压器RoME,将文字和视频分为三个层次;空间环境、时间背景和对象环境的作用。我们利用一个基于变压器的注意机制,充分利用全球和地方两级的视觉和文字嵌入空间,与专家混合考虑各种模式和结构的关联。结果显示,我们的方法超越了YouCook2和MSR-VTT数据集的最新设计方法,而没有预先培训,我们进行了广泛的调整研究,以阐明我们的设计选择。