While vision transformers have been highly successful in improving the performance in image-based tasks, not much work has been reported on applying transformers to multilingual scene text recognition due to the complexities in the visual appearance of multilingual texts. To fill the gap, this paper proposes an augmented transformer architecture with n-grams embedding and cross-language rectification (TANGER). TANGER consists of a primary transformer with single patch embeddings of visual images, and a supplementary transformer with adaptive n-grams embeddings that aims to flexibly explore the potential correlations between neighbouring visual patches, which is essential for feature extraction from multilingual scene texts. Cross-language rectification is achieved with a loss function that takes into account both language identification and contextual coherence scoring. Extensive comparative studies are conducted on four widely used benchmark datasets as well as a new multilingual scene text dataset containing Indonesian, English, and Chinese collected from tourism scenes in Indonesia. Our experimental results demonstrate that TANGER is considerably better compared to the state-of-the-art, especially in handling complex multilingual scene texts.
翻译:虽然视觉变压器在改善图像任务绩效方面非常成功,但由于多语种文本外观的复杂性,在将变压器应用到多语种文本的多语种文本识别方面,报告的工作不多。为填补这一空白,本文件建议扩大变压器结构,采用n克嵌入和跨语言校正(TANGER),TANGER由一级变压器和单一补丁嵌入视觉图像,以及一个补充变压器和适应性N克嵌入装置,目的是灵活探索相邻视觉补装之间的潜在关联,这对从多语种文本中提取特征至关重要。跨语言校正是通过一种损失功能实现的,同时兼顾语言识别和背景一致性评分。对四套广泛使用的基准数据集进行了广泛的比较研究,以及一套包含印度尼西亚、英语和中国语的新的多语种文本数据集。我们的实验结果表明,TANGER在处理复杂的多语种场景文本方面比最新技术要好得多。</s>