Any-scale image synthesis offers an efficient and scalable solution to synthesize photo-realistic images at any scale, even going beyond 2K resolution. However, existing GAN-based solutions depend excessively on convolutions and a hierarchical architecture, which introduce inconsistency and the $``$texture sticking$"$ issue when scaling the output resolution. From another perspective, INR-based generators are scale-equivariant by design, but their huge memory footprint and slow inference hinder these networks from being adopted in large-scale or real-time systems. In this work, we propose $\textbf{C}$olumn-$\textbf{R}$ow $\textbf{E}$ntangled $\textbf{P}$ixel $\textbf{S}$ynthesis ($\textbf{CREPS}$), a new generative model that is both efficient and scale-equivariant without using any spatial convolutions or coarse-to-fine design. To save memory footprint and make the system scalable, we employ a novel bi-line representation that decomposes layer-wise feature maps into separate $``$thick$"$ column and row encodings. Experiments on various datasets, including FFHQ, LSUN-Church, MetFaces, and Flickr-Scenery, confirm CREPS' ability to synthesize scale-consistent and alias-free images at any arbitrary resolution with proper training and inference speed. Code is available at https://github.com/VinAIResearch/CREPS.
翻译:任意尺度的图像合成提供了一种有效且可伸缩的解决方案,甚至可以超越2K分辨率合成逼真的图像。然而现有的基于GAN的解决方案过度依赖于卷积和分层架构,这些会在扩大输出分辨率时产生不一致性和“贴图”问题。从另一个角度看,INR-based(集成正规化)的生成器通过设计具有尺度等变性,但它们巨大的内存占用和慢速的推理阻碍了这些网络在大规模或实时系统中的应用。在这项工作中,我们提出了基于列-行纠缠像素合成(CREPS)的新型生成模型,它既高效又具有尺度等变性,而且不使用任何空间卷积或粗糙到精细的设计。为了节省内存占用并使系统可伸缩,我们采用一种新颖的双线性表示方法,将逐层的特征映射分解为独立的“厚”列和行编码。对各种数据集(包括FFHQ、LSUN-Church、MetFaces和Flickr-Scenery)的实验表明,CREPS能够在适当的训练和推理速度下合成任意分辨率的尺度一致且无失真的图像。代码可在https://github.com/VinAIResearch/CREPS找到。