图像BART: 自动递减图像合成多向扩散的双向环境 (ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis)

Autoregressive models and their sequential factorization of the data likelihood have recently demonstrated great potential for image representation and synthesis. Nevertheless, they incorporate image context in a linear 1D order by attending only to previously synthesized image patches above or to the left. Not only is this unidirectional, sequential bias of attention unnatural for images as it disregards large parts of a scene until synthesis is almost complete. It also processes the entire image on a single scale, thus ignoring more global contextual information up to the gist of the entire scene. As a remedy we incorporate a coarse-to-fine hierarchy of context by combining the autoregressive formulation with a multinomial diffusion process: Whereas a multistage diffusion process successively removes information to coarsen an image, we train a (short) Markov chain to invert this process. In each stage, the resulting autoregressive ImageBART model progressively incorporates context from previous stages in a coarse-to-fine manner. Experiments show greatly improved image modification capabilities over autoregressive models while also providing high-fidelity image generation, both of which are enabled through efficient training in a compressed latent space. Specifically, our approach can take unrestricted, user-provided masks into account to perform local image editing. Thus, in contrast to pure autoregressive models, it can solve free-form image inpainting and, in the case of conditional models, local, text-guided image modification without requiring mask-specific training.

翻译：自动递减模型及其数据概率的顺序乘数最近显示出了巨大的图像显示和合成潜力。然而,它们通过只关注先前合成的图像补丁或左侧,将图像背景纳入线形 1D 顺序。不仅这种单向、连续偏重的注意偏向, 因为它忽略了场景的大部分部分, 直到合成几乎完成。它还在单一规模上处理整个图像, 从而忽略整个场景的整个背景信息。作为一种补救措施, 我们通过将自动递增的配方与多位扩散进程相结合, 将图像背景置于直向直线的层次上。然而, 多阶段的传播过程不仅会将信息移到向左侧的图像补加固的图像补补补补。我们训练了一个(short) Markov 链以绕过此过程。在每个阶段, 由此产生的自动递减的图像模型会逐渐以粗略的方式整合前几个阶段的背景环境。实验显示, 自我递增的模型会大大改进的图像转换能力, 同时提供高偏重的图像生成模型, 两种过程都能够通过常规化的系统化的系统进行自我分析, 。