This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute \emph{Word2VisualVec}, a deep neural network architecture that learns to predict a visual feature representation from textual input. Example captions are encoded into a textual embedding based on multi-scale sentence vectorization and further transferred into a deep visual feature of choice via a simple multi-layer perceptron. We further generalize Word2VisualVec for video caption retrieval, by predicting from text both 3-D convolutional neural network features as well as a visual-audio representation. Experiments on Flickr8k, Flickr30k, the Microsoft Video Description dataset and the very recent NIST TrecVid challenge for video caption retrieval detail Word2VisualVec's properties, its benefit over textual embeddings, the potential for multimodal query composition and its state-of-the-art results.
翻译:本文努力在一组句子中找到最能描述特定图像或视频内容的句子。 与现有作品不同, 现有作品依靠一个联合子空间来进行图像和视频字幕检索, 我们提议完全在视觉空间这样做。 除了这个概念创新外, 我们贡献了一个深层次的神经网络结构, 学会从文字输入中预测视觉特征。 示例标题被编码成一个文本嵌入一个基于多尺度的句子矢量化的文本嵌入, 并通过简单的多层透视器进一步传输到一个深度的视觉选择特征。 我们进一步将Word2Vec 用于视频字幕检索, 从文本中预测 3D 进化神经网络的功能, 以及视觉显示 。 在 Flickr8k、 Flick30k、 微软视频描述数据集上进行实验, 最近 NIST TrecVivid对视频字幕检索细节 Word2 VisualVec 的质疑, 其相对于文本嵌入的优点, 其多式联运的构成及其状态结果。