学习视觉语义嵌入最佳集合战略 (Learning the Best Pooling Strategy for Visual Semantic Embedding)

Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions. Recent VSE models use complex methods to better contextualize and aggregate multi-modal features into holistic embeddings. However, we discover that surprisingly simple (but carefully selected) global pooling functions (e.g., max pooling) outperform those complex models, across different feature extractors. Despite its simplicity and effectiveness, seeking the best pooling function for different data modality and feature extractor is costly and tedious, especially when the size of features varies (e.g., text, video). Therefore, we propose a Generalized Pooling Operator (GPO), which learns to automatically adapt itself to the best pooling strategy for different features, requiring no manual tuning while staying effective and efficient. We extend the VSE model using this proposed GPO and denote it as VSE$\infty$. Without bells and whistles, VSE$\infty$ outperforms previous VSE methods significantly on image-text retrieval benchmarks across popular feature extractors. With a simple adaptation, variants of VSE$\infty$ further demonstrate its strength by achieving the new state of the art on two video-text retrieval datasets. Comprehensive experiments and visualizations confirm that GPO always discovers the best pooling strategy and can be a plug-and-play feature aggregation module for standard VSE models. Code and pre-trained models are available at $\href{https://vse-infty.github.io}{\text{https://vse-infty.github.io}}$.

翻译：(VSE) 是视觉语言检索的主导方法,目的是学习深层嵌入空间,让视觉数据嵌入接近语义文字标签或描述。最近的VSE模型使用复杂方法,使整体嵌入中更符合背景和综合多模式特性。然而,我们发现,惊人的简单(但仔细选择的)全球集合功能(例如,最大集合)优于这些复杂模型,跨不同功能提取器。尽管它简单和有效,但寻求不同数据模式和特征提取器的最佳集合功能成本高、烦琐,特别是在功能大小不同时(例如,文本、视频)尤其如此。因此,我们建议建立一个通用集合操作操作操作操作操作器(GPO),该操作器可以自动适应不同特性的最佳集合战略,不需要手动调整,同时保持效能和效率。我们使用这个拟议的VSEE模型,用VSO-inftrefty 美元表示它的最佳集合功能,不用BSESE-SE-Seco-deplical exal exal ex eximal ex ex ex exupal supal supal exis messal exal ex exupal ex ex ex exuptractions pas pas pal expal pressal destal degrestiewal develts a preal deal degilts a pal develts a prealdal degilts a preal destritalmentalmentalmental degilts a exital destritalital destritalmentalds palmentalmentalmentalds palds malds saldaldaldaldalds salds malds saldaldaldaldaldaldaldaldaldaldalds exaldaldaldaldaldaldaldaldaldaldaldaldal exalalalalalal exal exal exal exal exal exal exal exalalaldaldaldaldal exal exal exal exaldaldal