Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web. Generally, a video contains rich instance and event information and the query text only describes a part of the information. Thus, a video can correspond to multiple different text descriptions and queries. We call this phenomenon the ``Video-Text Correspondence Ambiguity'' problem. Current techniques mostly concentrate on mining local or multi-level alignment between contents of a video and text (\textit{e.g.}, object to entity and action to verb). It is difficult for these methods to alleviate the video-text correspondence ambiguity by describing a video using only one single feature, which is required to be matched with multiple different text features at the same time. To address this problem, we propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video by adaptive aggregation of video token features. Given a query text, the similarity is determined by the most similar prototype to find correspondence in the video, which is termed text-adaptive matching. To learn diverse prototypes for representing the rich information in videos, we propose a variance loss to encourage different prototypes to attend to different contents of the video. Our method outperforms state-of-the-art methods on four public video retrieval datasets.
翻译:视频和文本之间的跨式检索由于网上视频的迅速出现而引起越来越多的研究兴趣。 一般来说, 视频包含丰富的实例和事件信息, 询问文本只描述部分信息。 因此, 视频可以对应多个不同的文本描述和询问。 我们将此现象称为“ 视频- 文本对应性对应性” 问题。 目前的技术主要集中于本地或多层次视频和文本内容(\ textitle{ e.g.}, 对实体的反对和动词的动作)的匹配。 这些方法很难通过只用一个特性描述视频来减轻视频文本对应的模糊性, 仅用一个特性来描述视频, 需要同时匹配多个不同的文本描述和询问。 为了解决这个问题, 我们提议了一个文本- 版本匹配模型, 自动捕捉多个原型来描述视频和视频符号内容的适应性组合。 根据查询文本, 类似性由最相似的原型在视频中找到通信, 也就是文本- 文本- 4 文本匹配。 要学习不同版本的原型数据, 以不同的原型方式显示我们不同的原型格式的图像, 以不同的原型形式显示我们不同的原型格式的原型数据 。