Image inpainting is an underdetermined inverse problem, it naturally allows diverse contents that fill up the missing or corrupted regions reasonably and realistically. Prevalent approaches using convolutional neural networks (CNNs) can synthesize visually pleasant contents, but CNNs suffer from limited perception fields for capturing global features. With image-level attention, transformers enable to model long-range dependencies and generate diverse contents with autoregressive modeling of pixel-sequence distributions. However, the unidirectional attention in transformers is suboptimal as corrupted regions can have arbitrary shapes with contexts from arbitrary directions. We propose BAT-Fill, an image inpainting framework with a novel bidirectional autoregressive transformer (BAT) that models deep bidirectional contexts for autoregressive generation of diverse inpainting contents. BAT-Fill inherits the merits of transformers and CNNs in a two-stage manner, which allows to generate high-resolution contents without being constrained by the quadratic complexity of attention in transformers. Specifically, it first generates pluralistic image structures of low resolution by adapting transformers and then synthesizes realistic texture details of high resolutions with a CNN-based up-sampling network. Extensive experiments over multiple datasets show that BAT-Fill achieves superior diversity and fidelity in image inpainting qualitatively and quantitatively.
翻译:映射中的图像是一个未下定的反向问题, 它自然允许不同内容以合理和现实的方式填充缺失或腐败的区域。 使用进化神经网络( CNNs) 的先导方法可以将视觉上令人愉快的内容合成出来, 但CNN在捕捉全球特征的认知领域上却受限制。 在图像层面的注意下, 变压器能够模拟长距离依赖性, 并产生多种内容, 以像素序列分布的自动递增模型来生成多样化内容。 但是, 变压器的单向关注度不优化, 因为腐败区域可能具有任意方向的任意形状。 我们提议使用BAT- AT Fill( 彩色图象框架, 带有新型双向自动递增变异变异性变异性变异性新双向变异性变异性变异性( BAT AT) 。 BAT- Finaltialal 图像模型首先通过高分辨率变异性变异性图像来生成高层次的图像结构。 IMISal 图像模型通过高分辨率变异性变异性模型显示高分辨率的图像结构。