Image text retrieval is a task to search for the proper textual descriptions of the visual world and vice versa. One challenge of this task is the vulnerability to input image and text corruptions. Such corruptions are often unobserved during the training, and degrade the retrieval model decision quality substantially. In this paper, we propose a novel image text retrieval technique, referred to as robust visual semantic embedding (RVSE), which consists of novel image-based and text-based augmentation techniques called semantic preserving augmentation for image (SPAugI) and text (SPAugT). Since SPAugI and SPAugT change the original data in a way that its semantic information is preserved, we enforce the feature extractors to generate semantic aware embedding vectors regardless of the corruption, improving the model robustness significantly. From extensive experiments using benchmark datasets, we show that RVSE outperforms conventional retrieval schemes in terms of image-text retrieval performance.
翻译:图像检索是一项寻找视觉世界和文字世界的适当文字描述的任务, 反之亦然。 这项任务的一个挑战就是容易被输入图像和文字腐败。 这种腐败在培训期间往往不被观察到, 并大大降低了检索模式决定质量 。 在本文中, 我们提出一种新的图像文本检索技术, 称为强健的视觉语义嵌入( RVSE ), 由新型图像和文字增强技术组成, 称为图像的语义保护增强( SPAugI ) 和文字( SPAugT ) 。 由于 SPAugI 和 SPAugT 改变原始数据的方式使得其语义信息得以保存, 我们强制执行特性提取器, 生成有语义意识的嵌入矢量, 不论腐败如何, 大大改进模型的稳健性。 我们通过使用基准数据集进行的广泛实验, 显示 RVSEE在图像- 文本检索性能方面超过了常规的检索计划 。</s>