DocLangID: Improving Few-Shot Training to Identify the Language of Historical Documents - 专知论文

会员服务 ·

0

未标记 · 可辨认的 · 小样本学习 · Learning · 标注 ·

2023 年 5 月 3 日

DocLangID: Improving Few-Shot Training to Identify the Language of Historical Documents

翻译：暂无翻译

Furkan Simsek,Brian Pfitzmann,Hendrik Raetz,Jona Otholt,Haojin Yang,Christoph Meinel

from arxiv, 6 pages (including references and excluding appendix)

Language identification describes the task of recognizing the language of written text in documents. This information is crucial because it can be used to support the analysis of a document's vocabulary and context. Supervised learning methods in recent years have advanced the task of language identification. However, these methods usually require large labeled datasets, which often need to be included for various domains of images, such as documents or scene images. In this work, we propose DocLangID, a transfer learning approach to identify the language of unlabeled historical documents. We achieve this by first leveraging labeled data from a different but related domain of historical documents. Secondly, we implement a distance-based few-shot learning approach to adapt a convolutional neural network to new languages of the unlabeled dataset. By introducing small amounts of manually labeled examples from the set of unlabeled images, our feature extractor develops a better adaptability towards new and different data distributions of historical documents. We show that such a model can be effectively fine-tuned for the unlabeled set of images by only reusing the same few-shot examples. We showcase our work across 10 languages that mostly use the Latin script. Our experiments on historical documents demonstrate that our combined approach improves the language identification performance, achieving 74% recognition accuracy on the four unseen languages of the unlabeled dataset.

翻译：暂无翻译

0

相关内容

未标记

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

中低温固体氧化物燃料电池核-壳结构阴极的研究

国家自然科学基金

0+阅读 · 2014年12月31日

6.5-45.5GHz超宽带致冷接收机的关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

亚微米Beta-Li2TiO3的高稳定超胞结构：超临界水热场调制

国家自然科学基金

0+阅读 · 2013年12月31日

混凝土Weibull统计尺寸效应理论模型改进研究

国家自然科学基金

0+阅读 · 2013年12月31日

新型金属-有机骨架基Z型光催化产氢材料的合成及性能研究

国家自然科学基金

0+阅读 · 2013年12月31日

内生非晶复合材料在高速率动态冲击下的力学响应机理

国家自然科学基金

0+阅读 · 2012年12月31日

关于AI-半环簇与 Conway半环簇的研究

国家自然科学基金

1+阅读 · 2012年12月31日

笼的连通性研究

国家自然科学基金

0+阅读 · 2011年12月31日

重组酵母菌次生代谢产物中红色组分代谢与调控的分子机制

国家自然科学基金

0+阅读 · 2009年12月31日

等离子体改性活性炭纤维脱硫脱氮的研究

国家自然科学基金

0+阅读 · 2008年12月31日

Listener Model for the PhotoBook Referential Game with CLIPScores as Implicit Reference Chain

Arxiv

0+阅读 · 2023年6月16日

The Split Matters: Flat Minima Methods for Improving the Performance of GNNs

Arxiv

0+阅读 · 2023年6月15日

Neuroevolution is a Competitive Alternative to Reinforcement Learning for Skill Discovery

Arxiv

0+阅读 · 2023年6月15日

Improving Speech Enhancement Performance by Leveraging Contextual Broad Phonetic Class Information

Arxiv

0+阅读 · 2023年6月15日

Document Entity Retrieval with Massive and Noisy Pre-training

Arxiv

0+阅读 · 2023年6月15日

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Arxiv

0+阅读 · 2023年6月14日

A Survey of Deep Learning for Scientific Discovery

A Survey of Deep Learning for Scientific Discovery

Arxiv

29+阅读 · 2020年3月26日

Differentiable Reasoning on Large Knowledge Bases and Natural Language

Arxiv

12+阅读 · 2019年12月17日

How to train your MAML

Arxiv

26+阅读 · 2019年3月5日

DeepSeek: Content Based Image Search & Retrieval

Arxiv

13+阅读 · 2018年1月11日

VIP会员

文章信息

相关主题

小样本学习

相关VIP内容

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《代码、指挥与冲突：描绘军事人工智能的未来》报告

【斯坦福博士论文】面向地理空间数据的多模态与多尺度建模：时空生成式人工智能

美国启动“自有军事人工智能计划”：采用谷歌Gemini以推动全军人工智能应用

《创新与适应性作为军事成功的关键因素：来自俄乌战争的战略洞见》报告

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

相关论文

Listener Model for the PhotoBook Referential Game with CLIPScores as Implicit Reference Chain

Arxiv

0+阅读 · 2023年6月16日

The Split Matters: Flat Minima Methods for Improving the Performance of GNNs

Arxiv

0+阅读 · 2023年6月15日

Neuroevolution is a Competitive Alternative to Reinforcement Learning for Skill Discovery

Arxiv

0+阅读 · 2023年6月15日

Improving Speech Enhancement Performance by Leveraging Contextual Broad Phonetic Class Information

Arxiv

0+阅读 · 2023年6月15日

Document Entity Retrieval with Massive and Noisy Pre-training

Arxiv

0+阅读 · 2023年6月15日

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Arxiv

0+阅读 · 2023年6月14日

A Survey of Deep Learning for Scientific Discovery

A Survey of Deep Learning for Scientific Discovery

Arxiv

29+阅读 · 2020年3月26日

Differentiable Reasoning on Large Knowledge Bases and Natural Language

Arxiv

12+阅读 · 2019年12月17日

How to train your MAML

Arxiv

26+阅读 · 2019年3月5日

DeepSeek: Content Based Image Search & Retrieval

Arxiv

13+阅读 · 2018年1月11日

相关基金

中低温固体氧化物燃料电池核-壳结构阴极的研究

国家自然科学基金

0+阅读 · 2014年12月31日

6.5-45.5GHz超宽带致冷接收机的关键技术研究

国家自然科学基金

0+阅读 · 2014年12月31日

亚微米Beta-Li2TiO3的高稳定超胞结构：超临界水热场调制

国家自然科学基金

0+阅读 · 2013年12月31日

混凝土Weibull统计尺寸效应理论模型改进研究

国家自然科学基金

0+阅读 · 2013年12月31日

新型金属-有机骨架基Z型光催化产氢材料的合成及性能研究

国家自然科学基金

0+阅读 · 2013年12月31日

内生非晶复合材料在高速率动态冲击下的力学响应机理

国家自然科学基金

0+阅读 · 2012年12月31日

关于AI-半环簇与 Conway半环簇的研究

国家自然科学基金

1+阅读 · 2012年12月31日

笼的连通性研究

国家自然科学基金

0+阅读 · 2011年12月31日

重组酵母菌次生代谢产物中红色组分代谢与调控的分子机制

国家自然科学基金

0+阅读 · 2009年12月31日

等离子体改性活性炭纤维脱硫脱氮的研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员