Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions. Recent VSE models use complex methods to better contextualize and aggregate multi-modal features into holistic embeddings. However, we discover that surprisingly simple (but carefully selected) global pooling functions (\eg, max pooling) outperform those complex models, across different feature extractors. Despite its simplicity and effectiveness, seeking the best pooling function for different data modality and feature extractor is costly and tedious, especially when the size of features varies (\eg, text, video). Therefore, we propose a Generalized Pooling Operator (GPO), which learns to automatically adapt itself to the best pooling strategy for different features, requiring no manual tuning while staying effective and efficient. We extend the VSE model using this proposed GPO and denote it as VSE$\infty$. Without bells and whistles, VSE$\infty$ outperforms previous VSE methods significantly on image-text retrieval benchmarks across popular feature extractors. With a simple adaptation, variants of VSE$\infty$ further demonstrate its strength by achieving the new state of the art on two video-text retrieval datasets. Comprehensive experiments and visualizations confirm that GPO always discovers the best pooling strategy and can be a plug-and-play feature aggregation module for standard VSE models.
翻译:视觉和语义嵌入器(VSE)是视觉语言检索的主导方法,目的是学习深层嵌入空间,让视觉数据嵌入接近语义文字标签或描述。最近的VSE模型使用复杂方法,更好地将背景特征和集成多式特征纳入整体嵌入中。然而,我们发现,惊人的简单(但仔细选择的)全球集合功能(ge,最大集合)超越了这些复杂模型,跨越不同的地貌提取器。尽管它简单和有效,但寻求不同数据模式和特征提取器的最佳集合功能是昂贵和乏味的,特别是在功能大小不同时(像,文本,视频)。因此,我们建议建立一个通用集合操作操作操作器(GPO),该操作器可以自动适应不同特性的最佳集合战略,不需要人工调整,同时保持有效和高效。我们用这个拟议的GPOPO模型来扩展VSE模型,并用VSE值表示它的VSE值。不总是响和提示, VSE\ Infty值比值比值比值比值比值比值前的GSE值方法,大大调整VSE值方法,在图像和图像缩缩缩缩缩缩缩缩缩缩化的模型中,在图像和图像比值上,在图像和图像比值的缩缩缩缩缩缩缩缩算法的模型中,用新的模型的模型的模型中要更基数。