Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR) - which jointly learns a vision and language model - is poorly extensible to low-resource document collections, as learning a language-vision model requires extensive labeled sequences and compute. This study models OCR as a character level image retrieval problem, using a contrastively trained vision encoder. Because the model only learns characters' visual features, it is more sample efficient and extensible than existing architectures, enabling accurate OCR in settings where existing solutions fail. Crucially, the model opens new avenues for community engagement in making digital history more representative of documentary history.
翻译:每天都有数千名用户查询数字档案,但他们能访问的信息在文献史上的多样性方面不够代表性。 句子到句子架构通常用于光学字符识别(OCR),它同时学习视觉和语言模型,但它难以扩展到低资源文件集合上,因为学习语言-视觉模型需要大量标记序列和计算。 本研究将OCR建模为一个基于字符级图像检索的问题,使用对比训练的视觉编码器。 因为该模型仅学习字符的视觉特征,所以在样本效率和可扩展性方面比现有架构更加高效,可以在现有解决方案失败的情况下实现准确的OCR。 至关重要的是,该模型为社区参与在使数字史料更具文献史代表性方面开辟了新途径。