Document images can be affected by many degradation scenarios, which cause recognition and processing difficulties. In this age of digitization, it is important to denoise them for proper usage. To address this challenge, we present a new encoder-decoder architecture based on vision transformers to enhance both machine-printed and handwritten document images, in an end-to-end fashion. The encoder operates directly on the pixel patches with their positional information without the use of any convolutional layers, while the decoder reconstructs a clean image from the encoded patches. Conducted experiments show a superiority of the proposed model compared to the state-of the-art methods on several DIBCO benchmarks. Code and models will be publicly available at: \url{https://github.com/dali92002/DocEnTR}.
翻译:文档图像可能会受到许多降解情景的影响,这会造成识别和处理困难。 在这个数字化时代,必须将其密封起来,以便加以适当使用。为了应对这一挑战,我们以视觉变压器为基础,推出一个新的编码器解码器结构,以端到端的方式加强机器打印和手写文件图像。编码器直接在像素补丁上操作,不使用任何卷发层,而解码器则从编码补丁中重建干净的图像。进行实验显示,提议的模型优于DIBCO若干基准的最新方法。代码和模型将在以下网址公开提供:\url{https://github.comdali92002/DocEnTR}。