Tremendous efforts have been made on document image rectification, but how to learn effective representation of such distorted images is still under-explored. In this paper, we present DocMAE, a novel self-supervised framework for document image rectification. Our motivation is to encode the structural cues in document images by leveraging masked autoencoder to benefit the rectification, i.e., the document boundaries, and text lines. Specifically, we first mask random patches of the background-excluded document images and then reconstruct the missing pixels. With such a self-supervised learning approach, the network is encouraged to learn the intrinsic structure of deformed documents by restoring document boundaries and missing text lines. Transfer performance in the downstream rectification task validates the effectiveness of our method. Extensive experiments are conducted to demonstrate the effectiveness of our method.
翻译:摘要:已经为文档图像校正做出了巨大的努力,但如何学习这种扭曲图像的有效表示仍然是未被充分开发的。在本文中,我们提出了 DocMAE,这是一种新颖的自监督框架,用于文档图像校正。我们的动机是通过利用掩蔽自编码器来编码文档图像中的结构线索,以使文档边界和文本行受益于校正。具体来说,我们首先遮挡背景排除的文档图像的随机补丁,然后重新构建缺失的像素。通过这种自监督学习方法,网络被鼓励通过恢复文档边界和缺失的文本行来学习扭曲文档的内在结构。下游校正任务的转移性能验证了我们方法的有效性。进行了大量实验来证明我们方法的有效性。