Pre-training weakly related image-text pairs in the contrastive style shows great power in learning semantic aligning cross-modal models. The common choice to measure the distance between the feature representations of the image-text pairs is the cosine similarity, which can be considered as the negative inner product of features embedded on a sphere mathematically. While such topology benefits from the low computational resources consumption and a properly defined uniformity, typically, there are two major drawbacks when applied. First, it is vulnerable to the semantic ambiguity phenomenon resulting from the noise in the weakly-related image-text pairs. Second, the learning progress is unstable and fragile at the beginning. Although, in the practice of former studies, a learnable softmax temperature parameter and a long warmup scheme are employed to meliorate the training progress, still there lacks an in-depth analysis of these problems. In this work, we discuss the desired properties of the topology and its endowed distance function for the embedding vectors of feature representations from the view of optimization. We then propose a rather simple solution to improve the aforementioned problem. That is, we map the feature representations onto the oblique manifold endowed with the negative inner product as the distance function. In the experimental analysis, we show that we can improve the baseline performance by a large margin (e.g. 4% in the zero-shot image to text retrieval task) by changing only two lines of the training codes.
翻译:在对比式风格中,培训前与培训前关系薄弱的图像-文本配对在对比式风格中显示,在学习语义一致的图像-文本模型中表现出巨大的力量。 测量图像-文本配对特征表达方式特征之间的距离的常见选择是相近性, 可以被视为一个领域内嵌特征的负面内产产品。 虽然这种地形学得益于计算资源消耗量低和定义得当的统一性, 通常有两大缺点。 首先, 它容易受到语义模糊现象的影响, 其原因是与微弱相关的图像-文本配对中的噪音。 其次, 学习进展在开始时不稳定和脆弱。 尽管在以前的研究实践中, 一种可学习的软通温参数和长期暖化机制可以被视作一个在数学领域内嵌入的负面内嵌, 但仍然缺乏对这些问题的深入分析。 在这项工作中, 我们讨论表层学的预期特性和它所赋予的距离功能, 仅用于从优化角度嵌入特征表达的矢量。 我们随后提出了一个相当简单的解决方案来改进上述文本。 在以往的研究实践中, 我们用一个巨大的模型分析, 向上显示, 我们的深度分析, 向反向上显示, 我们的模型分析。