In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. We propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. The interpretability of LAFF can be used for feature selection. Extensive experiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX and TRECVID AVS 2016-2020) justify LAFF as a new baseline for text-to-video retrieval.
翻译:在本文中,我们在文本到视频检索的新背景下,重新审视了古老的组合专题。与以往的研究不同,以前的研究只考虑文本到视频检索的一端,让视频或文本成为视频或文本,我们的目标是在统一框架内为两端都进行特征融合。我们假设优化这些特征的组合,更倾向于通过计算超重的多头人自我关注来模拟其相关性。我们提议轻量级注意特征融合(LAFF)在早期和后期阶段,在视频和文本两个端都进行特征融合,使其成为利用多种(现成)特征的有力方法。LAFF的可解释性可用于特征选择。关于五个公共基准集(MSR-VTT、MSVD、TGIF、VateX和TRECVID AVS 2016-2020)的广泛实验证明LAFF作为文本到视频检索的新基线是合理的。