Existing techniques for image-to-image translation commonly have suffered from two critical problems: heavy reliance on per-sample domain annotation and/or inability of handling multiple attributes per image. Recent methods adopt clustering approaches to easily provide per-sample annotations in an unsupervised manner. However, they cannot account for the real-world setting; one sample may have multiple attributes. In addition, the semantics of the clusters are not easily coupled to human understanding. To overcome these, we present a LANguage-driven Image-to-image Translation model, dubbed LANIT. We leverage easy-to-obtain candidate domain annotations given in texts for a dataset and jointly optimize them during training. The target style is specified by aggregating multi-domain style vectors according to the multi-hot domain assignments. As the initial candidate domain texts might be inaccurate, we set the candidate domain texts to be learnable and jointly fine-tune them during training. Furthermore, we introduce a slack domain to cover samples that are not covered by the candidate domains. Experiments on several standard benchmarks demonstrate that LANIT achieves comparable or superior performance to the existing model.
翻译:图像到图像翻译的现有技术通常受到两个关键问题的影响:高度依赖每个抽样域说明和(或)无法处理每个图像的多重属性。最近的方法采用集群办法,方便地以不受监督的方式提供每个样本的说明;然而,它们无法说明真实世界的设置;一个样本可能具有多种属性。此外,集群的语义不容易与人类的理解结合起来。为了克服这些问题,我们提出了一个局域图驱动的图像到图像翻译模型,称为LANIT。我们利用文本中为数据集提供的易于获取的候选域说明,并在培训期间共同优化这些说明。目标样式是通过按照多光域任务将多色样式矢量汇总而具体化的。由于初始候选域文本可能不准确,我们设置候选域文本可以学习,在培训期间共同微调这些文本。此外,我们引入一个较松的域,以覆盖候选域域内没有覆盖的样本。在几个标准基准上进行的实验表明,局域网实现了与现有模型相似或高性。