The pressing need for digitization of historical document collections has led to a strong interest in designing computerised image processing methods for automatic handwritten text recognition (HTR). Handwritten text possesses high variability due to different writing styles, languages and scripts. Training an accurate and robust HTR system calls for data-efficient approaches due to the unavailability of sufficient amounts of annotated multi-writer text. A case study on an ongoing project ``Marginalia and Machine Learning" is presented here that focuses on automatic detection and recognition of handwritten marginalia texts i.e., text written in margins or handwritten notes. Faster R-CNN network is used for detection of marginalia and AttentionHTR is used for word recognition. The data comes from early book collections (printed) found in the Uppsala University Library, with handwritten marginalia texts. Source code and pretrained models are available at https://github.com/ektavats/Project-Marginalia.
翻译:由于对历史文件收藏的数字化的迫切需要,人们非常关注为自动手写文本识别设计计算机化图像处理方法(HTR),手写文本由于不同的写作风格、语言和脚本而变化很大。培训一个准确和健全的HTR系统需要数据效率高的方法,因为没有足够数量的附加说明的多文文本。此处介绍了关于正在进行的项目“Marginialia和机器学习”的案例研究,重点是自动检测和识别手写边际文字,即边际文字或手写笔记。快速R-CNN网络用于检测边际文字,用注意力HTR来识别文字。数据来自乌普萨拉大学图书馆的早期书籍收藏(印刷版),手写边际文字文本。源代码和预培训模式见https://github.com/ektavats/Project-Marginalia。</s>