In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure. Moreover, at the retrieval stage, we introduce additional softmax operations for revising the inferred query-video similarities. Extensive experiments in several setups based on three large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to best combine text-visual features and document the performance of the proposed network. Source code is made publicly available at: https://github.com/bmezaris/TextToVideoRetrieval-TtimesV
翻译:在本文中,我们处理跨模式视频检索问题,更具体地说,我们侧重于文本到视频检索;我们研究如何以最佳方式将多种不同的文本和视觉特征结合为成对功能,从而产生多种共同特点空间,将文本-视频对对等编码成可比较的表示;为了了解这些说明,我们拟议的网络结构是通过一个多空间学习程序来培训的;此外,在检索阶段,我们为修改推断的查询-视频相似点采用了额外的软式操作。在三个大型数据集(IACC.3、V3C1和MSR-VTT)的基础上,在几个设置中进行广泛的实验,从而得出如何最好地将文本-视觉特征结合并记录拟议网络的绩效的结论。源代码公布于:https://github.com/bmezaris/TextVideoRetriev-TatimesVTimesV)上。