Image inpainting is an ill-posed problem to recover missing or damaged image content based on incomplete images with masks. Previous works usually predict the auxiliary structures (e.g., edges, segmentation and contours) to help fill visually realistic patches in a multi-stage fashion. However, imprecise auxiliary priors may yield biased inpainted results. Besides, it is time-consuming for some methods to be implemented by multiple stages of complex neural networks. To solve this issue, we develop an end-to-end multi-modality guided transformer network, including one inpainting branch and two auxiliary branches for semantic segmentation and edge textures. Within each transformer block, the proposed multi-scale spatial-aware attention module can learn the multi-modal structural features efficiently via auxiliary denormalization. Different from previous methods relying on direct guidance from biased priors, our method enriches semantically consistent context in an image based on discriminative interplay information from multiple modalities. Comprehensive experiments on several challenging image inpainting datasets show that our method achieves state-of-the-art performance to deal with various regular/irregular masks efficiently.
翻译:图像涂色是一个错误的问题, 无法在使用面罩的不完整图像的基础上恢复丢失或损坏的图像内容。 以前的作品通常预测辅助结构( 边缘、 分割和轮廓等), 以帮助以多阶段的方式填充视觉现实的补丁。 但是, 不精确的辅助前科可能会产生偏差的涂色结果。 此外, 由复杂的神经网络的多个阶段来实施某些方法是耗时的。 为了解决这个问题, 我们开发了一个端到端多式导变器网络, 包括一个涂色分支和两个用于语义分解和边缘纹理的辅助分支。 每个变形块内, 提议的多尺度空间觉注意模块可以通过辅助脱光化来有效地学习多模式的结构特征。 不同于以前依靠偏差前导直接指导的方法, 我们的方法在基于多种模式的歧视性交互信息图像中丰富了从立体的内系一致性背景。 对若干具有挑战性图像的全面实验显示, 我们的方法实现了常规/ 常规交易的状态 。