重开OCR:质量评估和加强预测的机械学习方法 (Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction) - 专知论文

会员服务 ·

0

OCR · Machine Learning · Learning · 可约的 · Extensibility ·

2022 年 10 月 31 日

Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction

翻译：重开OCR:质量评估和加强预测的机械学习方法

Pit Schneider,Yves Maurer

from arxiv, Journal of Data Mining and Digital Humanities; Minor revision

Iterating with new and improved OCR solutions enforces decision making when it comes to targeting the right candidates for reprocessing. This especially applies when the underlying data collection is of considerable size and rather diverse in terms of fonts, languages, periods of publication and consequently OCR quality. This article captures the efforts of the National Library of Luxembourg to support those targeting decisions. They are crucial in order to guarantee low computational overhead and reduced quality degradation risks, combined with a more quantifiable OCR improvement. In particular, this work explains the methodology of the library with respect to text block level quality assessment. Through extension of this technique, a regression model, that is able to take into account the enhancement potential of a new OCR engine, is also presented. They both mark promising approaches, especially for cultural institutions dealing with historical data of lower quality.

翻译：结合新的和经过改进的OCR解决方案,在针对合适的后处理对象时,强制执行决策,特别是当基本数据收集在字体、语言、出版期和OCR质量方面规模很大,而且差别很大时,这尤其适用;这篇文章记录了卢森堡国家图书馆为支持针对这些决定所作的努力;对于保证低计算间接费用和降低质量退化风险,加上更量化的OCR改进至关重要;特别是,这项工作解释了图书馆在文本块级质量评估方面的方法;通过推广这一技术,还介绍了一种回归模型,能够考虑到新的OCR引擎的增强潜力;它们都标志着有希望的做法,特别是对于处理低质量历史数据的文化机构而言。

0

相关内容

OCR

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

pH/近红外光响应脂质体的构建及其光热化疗抗肿瘤研究

国家自然科学基金

0+阅读 · 2015年12月31日

非金属小团簇成键特征与光电功能分子设计的理论研究

国家自然科学基金

0+阅读 · 2014年12月31日

组合弹性结构问题的混合DG有限元方法与高效求解

国家自然科学基金

0+阅读 · 2011年12月31日

编码密码学中若干组合对象研究

国家自然科学基金

0+阅读 · 2009年12月31日

岩体结构数字识别与损伤力学行为表征方法研究

国家自然科学基金

0+阅读 · 2008年12月31日

Near-optimal Policy Identification in Active Reinforcement Learning

Arxiv

0+阅读 · 2022年12月19日

Answer-Set Programming for Lexicographical Makespan Optimisation in Parallel Machine Scheduling

Arxiv

0+阅读 · 2022年12月18日

Controllable Data Generation by Deep Learning: A Review

Arxiv

15+阅读 · 2022年7月19日

A Comprehensive Survey on Transfer Learning

A Comprehensive Survey on Transfer Learning

Arxiv

121+阅读 · 2019年11月7日

Taking Human out of Learning Applications: A Survey on Automated Machine Learning

Taking Human out of Learning Applications: A Survey on Automated Machine Learning

Arxiv

14+阅读 · 2019年1月17日

VIP会员

文章信息

相关主题

Machine Learning

相关VIP内容

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

NeurIPS 2025 | NMKE：基于神经元归因与动态稀疏掩码的终身知识编辑

前沿人工智能趋势报告（Frontier AI Trends Report）

【MIT博士论文】弱监督学习：理论、方法与应用

Andrej Karpathy：2025 年 LLM 年度回顾（2025 LLM Year in Review）

相关资讯

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

Near-optimal Policy Identification in Active Reinforcement Learning

Arxiv

0+阅读 · 2022年12月19日

Answer-Set Programming for Lexicographical Makespan Optimisation in Parallel Machine Scheduling

Arxiv

0+阅读 · 2022年12月18日

Controllable Data Generation by Deep Learning: A Review

Arxiv

15+阅读 · 2022年7月19日

A Comprehensive Survey on Transfer Learning

A Comprehensive Survey on Transfer Learning

Arxiv

121+阅读 · 2019年11月7日

Taking Human out of Learning Applications: A Survey on Automated Machine Learning

Taking Human out of Learning Applications: A Survey on Automated Machine Learning

Arxiv

14+阅读 · 2019年1月17日

相关基金

pH/近红外光响应脂质体的构建及其光热化疗抗肿瘤研究

国家自然科学基金

0+阅读 · 2015年12月31日

非金属小团簇成键特征与光电功能分子设计的理论研究

国家自然科学基金

0+阅读 · 2014年12月31日

组合弹性结构问题的混合DG有限元方法与高效求解

国家自然科学基金

0+阅读 · 2011年12月31日

编码密码学中若干组合对象研究

国家自然科学基金

0+阅读 · 2009年12月31日

岩体结构数字识别与损伤力学行为表征方法研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员