The paper discusses an approach to decipher large collections of handwritten index cards of historical dictionaries. Our study provides a working solution that reads the cards, and links their lemmas to a searchable list of dictionary entries, for a large historical dictionary entitled the Dictionary of the 17th- and 18th-century Polish, which comprizes 2.8 million index cards. We apply a tailored handwritten text recognition (HTR) solution that involves (1) an optimized detection model; (2) a recognition model to decipher the handwritten content, designed as a spatial transformer network (STN) followed by convolutional neural network (RCNN) with a connectionist temporal classification layer (CTC), trained using a synthetic set of 500,000 generated Polish words of different length; (3) a post-processing step using constrained Word Beam Search (WBC): the predictions were matched against a list of dictionary entries known in advance. Our model achieved the accuracy of 0.881 on the word level, which outperforms the base RCNN model. Within this study we produced a set of 20,000 manually annotated index cards that can be used for future benchmarks and transfer learning HTR applications.
翻译:本文讨论了一种解密历史词典大量手写索引卡的方法。我们提供了一个工作解决方案,能够读取这些卡片,并将它们的引文链接到一个可搜索的词典条目列表中。这个大型历史词典名为《17和18世纪波兰词典》(Dictionary of the 17th- and 18th-century Polish),包含280万张索引卡。我们采用了一种定制的手写文本识别(HTR,Handwritten Text Recognition)解决方案,包括:(1)一个优化的检测模型;(2)用于解密手写内容的识别模型,设计为一个空间变换网络(STN,Spatial Transformer Network)后跟卷积神经网络(RCNN),其在一个由50万个不同长度的合成波兰单词组成的合成数据集上进行训练,训练后使用连接主义时间分类(CTC)层;(3)使用受限的Word Beam Search(WBC)进行后处理:将预测值与事先已知的词典条目列表进行匹配。我们的模型在单词级别上实现了0.881的准确率,优于基本的RCNN模型。在这项研究中,我们制作了一组手工注释的2万张索引卡,可用于未来的基准测试和转移学习HTR应用程序。