In most scenarios, conditional image generation can be thought of as an inversion of the image understanding process. Since generic image understanding involves solving multiple tasks, it is natural to aim at generating images via multi-conditioning. However, multi-conditional image generation is a very challenging problem due to the heterogeneity and the sparsity of the (in practice) available conditioning labels. In this work, we propose a novel neural architecture to address the problem of heterogeneity and sparsity of the spatially multi-conditional labels. Our choice of spatial conditioning, such as by semantics and depth, is driven by the promise it holds for better control of the image generation process. The proposed method uses a transformer-like architecture operating pixel-wise, which receives the available labels as input tokens to merge them in a learned homogeneous space of labels. The merged labels are then used for image generation via conditional generative adversarial training. In this process, the sparsity of the labels is handled by simply dropping the input tokens corresponding to the missing labels at the desired locations, thanks to the proposed pixel-wise operating architecture. Our experiments on three benchmark datasets demonstrate the clear superiority of our method over the state-of-the-art and compared baselines. The source code will be made publicly available.
翻译:在多数情况下, 有条件的图像生成可被视为图像理解过程的反常。 由于通用图像理解涉及解决多重任务, 因此自然会通过多功能生成图像。 但是, 多条件图像生成是一个非常具有挑战性的问题, 原因是( 实践中) 现有调制标签的异质性和广度。 在这项工作中, 我们提议一个新的神经结构, 以解决空间多功能标签的异质性和宽度问题。 我们选择的空间调节, 如语义和深度, 是由它所持有的更好地控制图像生成过程的希望驱动的。 提议的方法使用一种变异器式结构, 运行像像像像的图像生成过程, 接收可用标签作为输入符号, 将其整合到一个学习的同质标签空间。 在这项工作中, 我们提出一个新的神经结构, 合并的标签的紧张性通过有条件的对抗性对抗性培训来解决。 在这个过程中, 我们处理这些标签的紧张性通过简单地丢弃与缺失标签对应的标签对应的输入符号, 其驱动力来自它所希望的位置, 是因为它持有的图像的类似结构, 比较了我们现有的标准基底线 。 我们的代码将演示了我们的数据源 。 将展示了 。