In this paper, we revisit \emph{feature fusion}, an old-fashioned topic, in the new context of video retrieval by text. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self-attention. Accordingly, we propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. Extensive experiments on four public datasets, i.e. MSR-VTT, MSVD, TGIF, VATEX, and the large-scale TRECVID AVS benchmark evaluations (2016-2020) show the viability of LAFF. Moreover, LAFF is extremely simple to implement, making it appealing for real-world deployment.
翻译:在本文中,我们重新审视一个老式专题,即用文字进行视频检索的新背景。与以往的研究不同,以前的研究只考虑一端的特征聚合,让它成为视频或文本,我们的目标是在一个统一的框架内为两端的两种目的进行特征融合。我们假设优化这些特征的曲线组合比通过计算重多头自省来模拟它们的相关性更为可取。因此,我们提议轻量级注意特征聚合(LAFF)在早期和后期阶段以及视频和文本两个端都进行特征融合,使之成为一种利用多种(现成)特征的强大方法。关于四个公共数据集的广泛实验,即MSR-VTT、MSVD、TGIF、VATIX和大规模TRECVID AVS基准评估(2016-2020年),显示了LAFF的可行性。此外,LAFF非常简单,可以实施,可以吸引实际部署。