We propose a novel method capable of retrieving clips from untrimmed videos based on natural language queries. This cross-modal retrieval task plays a key role in visual-semantic understanding, and requires localizing clips in time and computing their similarity to the query sentence. Current methods generate sentence and video embeddings and then compare them using a late fusion approach, but this ignores the word order in queries and prevents more fine-grained comparisons. Motivated by the need for fine-grained multi-modal feature fusion, we propose a novel early fusion embedding approach that combines video and language information at the word level. Furthermore, we use the inverse task of dense video captioning as a side-task to improve the learned embedding. Our full model combines these components with an efficient proposal pipeline that performs accurate localization of potential video clips. We present a comprehensive experimental validation on two large-scale text-to-clip datasets (Charades-STA and DiDeMo) and attain state-of-the-art retrieval results with our model.
翻译:我们建议一种基于自然语言查询的新颖方法,从未剪辑的视频中提取剪辑。 这个跨模式检索任务在视觉和语义理解方面发挥着关键作用, 需要及时对剪辑进行本地化剪辑, 并计算它们与查询句子的相似性 。 目前的方法产生句子和视频嵌入, 然后使用晚期混合方法比较它们, 但是这忽略了查询中的单词顺序, 并且防止了更细微的对比 。 受精微的多模式特性聚合需要的驱动, 我们提议了一种新颖的早期融合嵌入方法, 将视频和语言信息结合起来 。 此外, 我们用密集的视频字幕作为副任务来改进学习嵌入。 我们的完整模型将这些组件与高效的管道连接起来, 对潜在的视频剪辑进行准确的本地化。 我们对两个大型文本到剪辑数据集( Charadedes-STA 和 DiDeMo) 进行了全面的实验性验证, 并且与我们的模型取得最先进的检索结果 。