Recently, there has been substantial progress in image synthesis from semantic labelmaps. However, methods used for this task assume the availability of complete and unambiguous labelmaps, with instance boundaries of objects, and class labels for each pixel. This reliance on heavily annotated inputs restricts the application of image synthesis techniques to real-world applications, especially under uncertainty due to weather, occlusion, or noise. On the other hand, algorithms that can synthesize images from sparse labelmaps or sketches are highly desirable as tools that can guide content creators and artists to quickly generate scenes by simply specifying locations of a few objects. In this paper, we address the problem of complex scene completion from sparse labelmaps. Under this setting, very few details about the scene (30\% of object instances) are available as input for image synthesis. We propose a two-stage deep network based method, called `Halluci-Net', that learns co-occurence relationships between objects in scenes, and then exploits these relationships to produce a dense and complete labelmap. The generated dense labelmap can then be used as input by state-of-the-art image synthesis techniques like pix2pixHD to obtain the final image. The proposed method is evaluated on the Cityscapes dataset and it outperforms two baselines methods on performance metrics like Fr\'echet Inception Distance (FID), semantic segmentation accuracy, and similarity in object co-occurrences. We also show qualitative results on a subset of ADE20K dataset that contains bedroom images.
翻译:最近,在语义标签图的图像合成方面取得了很大进展。 然而, 用于此任务的方法假定提供了完整和清晰的标签图, 以及每个像素的对象和类标签。 这种依赖大量注解的投入限制了将图像合成技术应用于真实世界的应用, 特别是由于天气、 隐蔽性或噪音等原因的不确定性。 另一方面, 能够从稀疏的标签图或草图中合成图像的算法是非常可取的, 可以作为指导内容创建者和艺术家通过简单指定一些对象的位置快速生成图像的工具。 在本文中, 我们解决了由稀少的标签图示来完成复杂场景的问题。 在此设置下, 很少有关于图像合成的场景细节( 对象实例的30 个), 可以作为图像合成的输入。 我们提出一个基于两个阶段的深网络方法, 叫做“ Halluc- Net ”, 可以学习屏幕中对象之间的共深层关系, 然后利用这些关系来生成一个密度和完整的标签图。 生成的密度标签图谱, 然后, 生成的密度标签图谱也可以用来作为州- 地区缩缩缩图的缩缩图, 。