Low resource Handwritten Text Recognition (HTR) is a hard problem due to the scarce annotated data and the very limited linguistic information (dictionaries and language models). This appears, for example, in the case of historical ciphered manuscripts, which are usually written with invented alphabets to hide the content. Thus, in this paper we address this problem through a data generation technique based on Bayesian Program Learning (BPL). Contrary to traditional generation approaches, which require a huge amount of annotated images, our method is able to generate human-like handwriting using only one sample of each symbol from the desired alphabet. After generating symbols, we create synthetic lines to train state-of-the-art HTR architectures in a segmentation free fashion. Quantitative and qualitative analyses were carried out and confirm the effectiveness of the proposed method, achieving competitive results compared to the usage of real annotated data.
翻译:低资源手写文字识别(HTR)是一个棘手的问题,因为缺少附加说明的数据,语言信息(词典和语言模式)非常有限。例如,历史密码手稿通常用发明的字母写来隐藏内容。因此,在本文中,我们通过基于巴伊西亚方案学习(BPL)的数据生成技术来解决这个问题。与传统的代代方法相反,这需要大量附加说明的图像,我们的方法能够产生像人一样的笔迹,只使用理想字母表的每个符号的样本。在生成符号后,我们创建合成线,以自由分割方式培训最先进的HTR结构。进行了定量和定性分析,确认拟议方法的有效性,与实际附加说明数据的使用相比,取得竞争性的结果。