Cleansing Jewel: 建立在 Google OCR 藏文手稿基础上的神经拼写修正模型 (Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts) - 专知论文

会员服务 ·

0

OCR · GRU · 长短期记忆网络 · 置信度 · 变换 ·

2023 年 4 月 7 日

Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

翻译：Cleansing Jewel: 建立在 Google OCR 藏文手稿基础上的神经拼写修正模型

Queenie Luo,Yung-Sung Chuang

Scholars in the humanities rely heavily on ancient manuscripts to study history, religion, and socio-political structures in the past. Many efforts have been devoted to digitizing these precious manuscripts using OCR technology, but most manuscripts were blemished over the centuries so that an Optical Character Recognition (OCR) program cannot be expected to capture faded graphs and stains on pages. This work presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output. This paper is divided into four sections: dataset, model architecture, training and analysis. First, we feature-engineered our raw Tibetan etext corpus into two sets of structured data frames -- a set of paired toy data and a set of paired real data. Then, we implemented a Confidence Score mechanism into the Transformer architecture to perform spelling correction tasks. According to the Loss and Character Error Rate, our Transformer + Confidence score mechanism architecture proves to be superior to Transformer, LSTM-2-LSTM and GRU-2-GRU architectures. Finally, to examine the robustness of our model, we analyzed erroneous tokens, visualized Attention and Self-Attention heatmaps in our model.

翻译：人文学者在研究历史、宗教和社会政治结构方面很大程度上依赖古老手稿。为了将这些宝贵手稿数字化，已经投入了许多的工作，但大多数手稿在数个世纪中都破损不堪，因而光学字符识别（OCR）程序无法捕捉化掉的文字和页面上的污渍。这项工作提出了一种建立在 Google OCR 藏文手稿基础上的神经拼写修正模型，以自动纠正 OCR 输出的噪声。本文分为四个部分：数据集、模型架构、训练和分析。首先，我们将原始的藏文电子文本语料库功能工程化为两组结构化数据框 -- 一组是配对的玩具数据，另一组是配对的真实数据。然后，我们在 Transformer 架构中实现了置信度得分机制来执行拼写修正任务。根据损失和字符错误率，我们 Transformer + 置信度得分机制架构证明优于 Transformer、LSTM-2- LSTM、GRU-2-GRU 架构。最后，为了检验我们模型的稳健性，我们分析了错误的标记，并在我们的模型中可视化了注意力和自我注意力热图。

0

相关内容

OCR

自然语言处理顶会NAACL2022最佳论文出炉！

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

320+阅读 · 2020年11月26日

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

专知会员服务

15+阅读 · 2020年5月5日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

【牛津大学-DeepMind 】上下文嵌入综述，A Survey on Contextual Embeddings

【牛津大学-DeepMind 】上下文嵌入综述，A Survey on Contextual Embeddings

专知会员服务

42+阅读 · 2020年3月17日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

专知会员服务

33+阅读 · 2020年1月5日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

【ICML2019】IanGoodfellow自注意力GAN的代码与PPT

【ICML2019】IanGoodfellow自注意力GAN的代码与PPT

GAN生成式对抗网络

18+阅读 · 2019年6月30日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

独家 | NLP详细教程：手把手教你用ELMo模型提取文本特征（附代码&论文）

独家 | NLP详细教程：手把手教你用ELMo模型提取文本特征（附代码&论文）

数据派THU

15+阅读 · 2019年4月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【论文推荐】最新八篇图像描述生成相关论文—比较级对抗学习、正则化RNNs、深层网络、视觉对话、婴儿说话、自我检索

【论文推荐】最新八篇图像描述生成相关论文—比较级对抗学习、正则化RNNs、深层网络、视觉对话、婴儿说话、自我检索

专知

10+阅读 · 2018年4月12日

基于LSTM-CNN组合模型的Twitter情感分析（附代码）

基于LSTM-CNN组合模型的Twitter情感分析（附代码）

机器学习研究会

50+阅读 · 2018年2月21日

【论文推荐】最新六篇视频分类相关论文—层次标签推断、知识图谱、CNNs、DAiSEE、表观和关系网络、转移学习

【论文推荐】最新六篇视频分类相关论文—层次标签推断、知识图谱、CNNs、DAiSEE、表观和关系网络、转移学习

专知

13+阅读 · 2018年2月18日

基于广义积分变换法的深水立管涡激振动预报模型研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于权重函数修正的大气CO2垂直柱浓度遥测算法研究

国家自然科学基金

0+阅读 · 2013年12月31日

大气/植被界面氨气交换通量及其对氮沉降总量的贡献

国家自然科学基金

0+阅读 · 2013年12月31日

基于SURE/PURE准则的图像盲反卷积算法研究

国家自然科学基金

3+阅读 · 2013年12月31日

青藏高原表层土壤湿度卫星微波遥感研究

国家自然科学基金

0+阅读 · 2012年12月31日

西藏典型斑岩型铜矿床遥感蚀变信息重现性机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

黄淮海平原小麦的极端气候灾害风险评价及其适应研究

国家自然科学基金

0+阅读 · 2012年12月31日

多源数据小麦病害遥感识别与监测方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

南极海冰与冰盖的质量变化及其对全球海平面变化贡献的研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于细胞凋亡抑制途径的酵母耐铝性及其胞内钙信号调控分子机理研究

国家自然科学基金

0+阅读 · 2008年12月31日

Representing Piecewise Linear Functions by Functions with Small Arity

Arxiv

0+阅读 · 2023年5月26日

Bulk-Switching Memristor-based Compute-In-Memory Module for Deep Neural Network Training

Arxiv

0+阅读 · 2023年5月23日

Graph Neural Networks for Text Classification: A Survey

Arxiv

34+阅读 · 2023年4月27日

A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective

Arxiv

21+阅读 · 2022年9月27日

Multi-Modal Knowledge Graph Construction and Application: A Survey

Arxiv

79+阅读 · 2022年2月11日

Lifelong Learning Metrics

Lifelong Learning Metrics

Arxiv

48+阅读 · 2022年1月20日

Read, Retrospect, Select: An MRC Framework to Short Text Entity Linking

Arxiv

11+阅读 · 2021年1月7日

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Arxiv

78+阅读 · 2019年11月10日

Neural Architecture Search: A Survey

Arxiv

12+阅读 · 2018年9月5日

Constructing Narrative Event Evolutionary Graph for Script Event Prediction

Arxiv

11+阅读 · 2018年5月16日

VIP会员

文章信息

相关主题

长短期记忆网络

相关VIP内容

自然语言处理顶会NAACL2022最佳论文出炉！

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

320+阅读 · 2020年11月26日

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

【ACL2020-亚马逊】Transformers多分辨率和多模态语音识别，Multiresolution and Multimodal Speech Recognition with Transformers

专知会员服务

15+阅读 · 2020年5月5日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

【牛津大学-DeepMind 】上下文嵌入综述，A Survey on Contextual Embeddings

【牛津大学-DeepMind 】上下文嵌入综述，A Survey on Contextual Embeddings

专知会员服务

42+阅读 · 2020年3月17日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

专知会员服务

33+阅读 · 2020年1月5日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《基于大型语言模型的软件工程自动化研究》最新264页

《基于大型语言模型的信号处理管线研究：推进军事电子情报工作流程》最新76页

中文版 | 战争算法：生成式人工智能在战场的崛起

中文版《美国陆军：战术行为性远程医疗实施观察与建议》

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

【ICML2019】IanGoodfellow自注意力GAN的代码与PPT

【ICML2019】IanGoodfellow自注意力GAN的代码与PPT

GAN生成式对抗网络

18+阅读 · 2019年6月30日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

独家 | NLP详细教程：手把手教你用ELMo模型提取文本特征（附代码&论文）

独家 | NLP详细教程：手把手教你用ELMo模型提取文本特征（附代码&论文）

数据派THU

15+阅读 · 2019年4月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【论文推荐】最新八篇图像描述生成相关论文—比较级对抗学习、正则化RNNs、深层网络、视觉对话、婴儿说话、自我检索

【论文推荐】最新八篇图像描述生成相关论文—比较级对抗学习、正则化RNNs、深层网络、视觉对话、婴儿说话、自我检索

专知

10+阅读 · 2018年4月12日

基于LSTM-CNN组合模型的Twitter情感分析（附代码）

基于LSTM-CNN组合模型的Twitter情感分析（附代码）

机器学习研究会

50+阅读 · 2018年2月21日

【论文推荐】最新六篇视频分类相关论文—层次标签推断、知识图谱、CNNs、DAiSEE、表观和关系网络、转移学习

【论文推荐】最新六篇视频分类相关论文—层次标签推断、知识图谱、CNNs、DAiSEE、表观和关系网络、转移学习

专知

13+阅读 · 2018年2月18日

相关论文

Representing Piecewise Linear Functions by Functions with Small Arity

Arxiv

0+阅读 · 2023年5月26日

Bulk-Switching Memristor-based Compute-In-Memory Module for Deep Neural Network Training

Arxiv

0+阅读 · 2023年5月23日

Graph Neural Networks for Text Classification: A Survey

Arxiv

34+阅读 · 2023年4月27日

A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective

Arxiv

21+阅读 · 2022年9月27日

Multi-Modal Knowledge Graph Construction and Application: A Survey

Arxiv

79+阅读 · 2022年2月11日

Lifelong Learning Metrics

Lifelong Learning Metrics

Arxiv

48+阅读 · 2022年1月20日

Read, Retrospect, Select: An MRC Framework to Short Text Entity Linking

Arxiv

11+阅读 · 2021年1月7日

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Arxiv

78+阅读 · 2019年11月10日

Neural Architecture Search: A Survey

Arxiv

12+阅读 · 2018年9月5日

Constructing Narrative Event Evolutionary Graph for Script Event Prediction

Arxiv

11+阅读 · 2018年5月16日

相关基金

基于广义积分变换法的深水立管涡激振动预报模型研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于权重函数修正的大气CO2垂直柱浓度遥测算法研究

国家自然科学基金

0+阅读 · 2013年12月31日

大气/植被界面氨气交换通量及其对氮沉降总量的贡献

国家自然科学基金

0+阅读 · 2013年12月31日

基于SURE/PURE准则的图像盲反卷积算法研究

国家自然科学基金

3+阅读 · 2013年12月31日

青藏高原表层土壤湿度卫星微波遥感研究

国家自然科学基金

0+阅读 · 2012年12月31日

西藏典型斑岩型铜矿床遥感蚀变信息重现性机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

黄淮海平原小麦的极端气候灾害风险评价及其适应研究

国家自然科学基金

0+阅读 · 2012年12月31日

多源数据小麦病害遥感识别与监测方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

南极海冰与冰盖的质量变化及其对全球海平面变化贡献的研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于细胞凋亡抑制途径的酵母耐铝性及其胞内钙信号调控分子机理研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员