This research is the second phase in a series of investigations on developing an Optical Character Recognition (OCR) of Arabic historical documents and examining how different modeling procedures interact with the problem. The first research studied the effect of Transformers on our custom-built Arabic dataset. One of the downsides of the first research was the size of the training data, a mere 15000 images from our 30 million images, due to lack of resources. Also, we add an image enhancement layer, time and space optimization, and Post-Correction layer to aid the model in predicting the correct word for the correct context. Notably, we propose an end-to-end text recognition approach using Vision Transformers as an encoder, namely BEIT, and vanilla Transformer as a decoder, eliminating CNNs for feature extraction and reducing the model's complexity. The experiments show that our end-to-end model outperforms Convolutions Backbones. The model attained a CER of 4.46%.
翻译:这项研究是一系列调查的第二阶段,这些调查涉及开发阿拉伯历史文件的光性特征识别(OCR)以及研究不同模型程序如何与问题发生相互作用。第一项研究研究了变异器对我们定制的阿拉伯数据集的影响。第一项研究的一个缺点是培训数据的规模,由于资源匮乏,我们3 000万图像中只有15 000张图像,这是我们3 000万图像中由于资源匮乏而出现的。此外,我们增加了一个图像增强层、时间和空间优化以及校正后层,以帮助模型预测正确背景的正确词。值得注意的是,我们建议采用终端到终端文本识别方法,使用Vision变异器作为编码器,即BEIT,以及香草变异器作为解码器,消除CNN的特性提取和降低模型复杂性。实验显示,我们的端到端模型超越了组合背骨。该模型获得了4.46 %的CER。