This paper investigates an open research problem of generating text-image pairs to improve the training of fine-grained image-to-text cross-modal retrieval task, and proposes a novel framework for paired data augmentation by uncovering the hidden semantic information of StyleGAN2 model. Specifically, we first train a StyleGAN2 model on the given dataset. We then project the real images back to the latent space of StyleGAN2 to obtain the latent codes. To make the generated images manipulatable, we further introduce a latent space alignment module to learn the alignment between StyleGAN2 latent codes and the corresponding textual caption features. When we do online paired data augmentation, we first generate augmented text through random token replacement, then pass the augmented text into the latent space alignment module to output the latent codes, which are finally fed to StyleGAN2 to generate the augmented images. We evaluate the efficacy of our augmented data approach on two public cross-modal retrieval datasets, in which the promising experimental results demonstrate the augmented text-image pair data can be trained together with the original data to boost the image-to-text cross-modal retrieval performance.
翻译:本文调查了生成文本图像配对的开放研究问题, 以改善微微图像到文本跨模式检索任务的培训, 并提出了一个配置数据增强配对的新框架。 具体地说, 我们首先在给定的数据集上训练StyleGAN2 模型。 我们然后将真实图像投射回StyleGAN2 的潜在空间, 以获取潜在代码 。 为了让生成的图像可以调控, 我们进一步引入了潜伏空间校正模块, 以学习StyleGAN2 潜在代码和对应文本说明功能之间的匹配。 当我们进行在线配对数据增强时, 我们首先通过随机替换生成增强文本, 然后将增强的文本传送到潜在的空间校正模块中, 以输出潜在的代码, 最终被输入StyleGAN2 以生成增强的图像 。 我们评估了我们在两个公共跨模式检索数据集上强化的数据方法的功效, 在其中, 充满希望的实验结果显示, 增强的文本模组数据可以与原始数据一起被培训, 推进图像到文本跨模式检索的功能 。