Existing techniques for image-to-image translation commonly have suffered from two critical problems: heavy reliance on per-sample domain annotation and/or inability of handling multiple attributes per image. Recent truly-unsupervised methods adopt clustering approaches to easily provide per-sample one-hot domain labels. However, they cannot account for the real-world setting: one sample may have multiple attributes. In addition, the semantics of the clusters are not easily coupled to the human understanding. To overcome these, we present a LANguage-driven Image-to-image Translation model, dubbed LANIT. We leverage easy-to-obtain candidate attributes given in texts for a dataset: the similarity between images and attributes indicates per-sample domain labels. This formulation naturally enables multi-hot label so that users can specify the target domain with a set of attributes in language. To account for the case that the initial prompts are inaccurate, we also present prompt learning. We further present domain regularization loss that enforces translated images be mapped to the corresponding domain. Experiments on several standard benchmarks demonstrate that LANIT achieves comparable or superior performance to existing models.
翻译:图像到图像翻译的现有技术通常受到两个关键问题的影响:严重依赖每个样本域的注释和/或无法处理每个图像的多重属性。最近,真正无人监督的方法采用集群方法,方便提供每个样本的单热域标签。然而,它们无法说明真实世界的设置:一个样本可能具有多个属性。此外,集群的语义不容易与人类的理解结合起来。为了克服这些问题,我们提出了一个以本地语驱动的图像到图像翻译模型,称为LANIT。我们利用文本中为数据集提供的易于获取的候选属性:图像和属性之间的相似性表明每个样本域标签。这种配方自然使多热标签能够使用户能够用语言的一组属性指定目标域。考虑到初始提示不准确的情况,我们还提供迅速的学习。我们进一步展示了实施翻译图像到相应域域的域规范损失。在几个标准基准上进行的实验表明,局域信息网络实现了可比较的或高端功能。