In this paper, we propose a Text-Degradation Invariant Auto Encoder (Text-DIAE), a self-supervised model designed to tackle two tasks, text recognition (handwritten or scene-text) and document image enhancement. We start by employing a transformer-based architecture that incorporates three pretext tasks as learning objectives to be optimized during pre-training without the usage of labeled data. Each of the pretext objectives is specifically tailored for the final downstream tasks. We conduct several ablation experiments that confirm the design choice of the selected pretext tasks. Importantly, the proposed model does not exhibit limitations of previous state-of-the-art methods based on contrastive losses, while at the same time requiring substantially fewer data samples to converge. Finally, we demonstrate that our method surpasses the state-of-the-art in existing supervised and self-supervised settings in handwritten and scene text recognition and document image enhancement. Our code and trained models will be made publicly available at~\url{ http://Upon_Acceptance}.
翻译:在本文中,我们建议采用一个自监督模型(Text-DIAE),用于处理文本识别(手写或场景文本)和文件图像增强等两项任务。我们首先采用基于变压器的架构,将三个借口任务作为学习目标,在培训前不使用标签数据加以优化。每个借口目标都具体针对最后的下游任务。我们进行了若干次模拟实验,以确认选定借口任务的设计选择。重要的是,拟议的模型没有表现出基于对比性损失的以往最先进方法的局限性,同时需要大量减少数据样本。最后,我们证明我们的方法超过了手写和场景文本识别和文件图像增强的现有监管和自我监督环境中的状态。我们的代码和经过培训的模型将在<url{http://Upon_Accepance}上公布。