Many recent inpainting works have achieved impressive results by leveraging Deep Neural Networks (DNNs) to model various prior information for image restoration. Unfortunately, the performance of these methods is largely limited by the representation ability of vanilla Convolutional Neural Networks (CNNs) backbones.On the other hand, Vision Transformers (ViT) with self-supervised pre-training have shown great potential for many visual recognition and object detection tasks. A natural question is whether the inpainting task can be greatly benefited from the ViT backbone? However, it is nontrivial to directly replace the new backbones in inpainting networks, as the inpainting is an inverse problem fundamentally different from the recognition tasks. To this end, this paper incorporates the pre-training based Masked AutoEncoder (MAE) into the inpainting model, which enjoys richer informative priors to enhance the inpainting process. Moreover, we propose to use attention priors from MAE to make the inpainting model learn more long-distance dependencies between masked and unmasked regions. Sufficient ablations have been discussed about the inpainting and the self-supervised pre-training models in this paper. Besides, experiments on both Places2 and FFHQ demonstrate the effectiveness of our proposed model. Codes and pre-trained models are released in https://github.com/ewrfcas/MAE-FAR.
翻译:许多近期的绘画工作都通过利用深神经网络(DNNs)来模拟各种先前的信息以恢复图像而取得了令人印象深刻的成果。不幸的是,这些方法的绩效在很大程度上受到香草革命神经网络(CNNs)骨干的代表性能力的限制。另一方面,具有自我监督的训练前导师的愿景变异器(VIT)在很多视觉识别和目标探测任务方面的潜力很大。一个自然的问题是,绘画任务能否从ViT的骨干中大大获益?然而,直接取代新的油漆网络的骨干是没有用的,因为油漆是完全不同于识别任务的一个反面问题。为此,本文将培训前基于蒙蔽的自动编码(MAE)的预培训前导师(VIT)纳入了油漆模型,从而强化了摄制过程。此外,我们建议利用MAE的前导研究前的注意使画模型更能了解隐藏的和未涂图的网络的骨干,因为油漆是一个与识别任务根本不同的问题。在纸质测试前的模型中和内部的自我训练前的模型中都讨论了。