HMBERT: 名称实体识别历史多语言语言模式 (hmBERT: Historical Multilingual Language Models for Named Entity Recognition) - 专知论文

会员服务 ·

0

语言模型化 · 命名实体识别 · entity · MoDELS · 可辨认的 ·

2022 年 5 月 31 日

hmBERT: Historical Multilingual Language Models for Named Entity Recognition

翻译：HMBERT: 名称实体识别历史多语言语言模式

Stefan Schweter,Luisa März,Katharina Schmid,Erion Çano

from arxiv, Submitted HIPE-2022 Working Note Paper for CLEF 2022 (Conference and Labs of the Evaluation Forum (CLEF 2022))

Compared to standard Named Entity Recognition (NER), identifying persons, locations, and organizations in historical texts forms a big challenge. To obtain machine-readable corpora, the historical text is usually scanned and optical character recognition (OCR) needs to be performed. As a result, the historical corpora contain errors. Also, entities like location or organization can change over time, which poses another challenge. Overall historical texts come with several peculiarities that differ greatly from modern texts and large labeled corpora for training a neural tagger are hardly available for this domain. In this work, we tackle NER for historical German, English, French, Swedish, and Finnish by training large historical language models. We circumvent the need for labeled data by using unlabeled data for pretraining a language model. hmBERT, a historical multilingual BERT-based language model is proposed, with different sizes of it being publicly released. Furthermore, we evaluate the capability of hmBERT by solving downstream NER as part of this year's HIPE-2022 shared task and provide detailed analysis and insights. For the Multilingual Classical Commentary coarse-grained NER challenge, our tagger HISTeria outperforms the other teams' models for two out of three languages.

翻译：与标准命名实体识别(NER)相比,在历史文本中识别个人、地点和组织是一个巨大的挑战。要获得机器可读体体,历史文本通常是扫描的,需要进行光学字符识别(OCR)。因此,历史文本包含错误。此外,像地点或组织这样的实体可能随时间而变化,这又是一个挑战。总体历史文本具有若干不同之处,与现代文本有很大不同,对于该领域来说,很难找到用于培训神经塔格的大型标签公司。在这项工作中,我们通过培训大型历史语言模型,解决历史德文、英文、法文、瑞典文和芬兰文的NER。我们通过使用未贴标签的数据对语言模型进行预培训而避免使用标签数据的必要性。 HMBERT提出了一种基于历史多语言的BERT语言模型,其规模不同,公开发布。此外,我们通过解决下游NER技术作为本年度HIPE-2022共同任务的一部分来评估HBERT的能力,并提供详细的分析和洞察力。对于多种语言的Collearrical comvical coregy regresuted ex for other sours for other other other strupleglegleglegresuteaker of other sultragleglegleglemental

0

相关内容

语言模型化

语言模型化

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

专知会员服务

61+阅读 · 2020年5月15日

【论文翻译】2020最新预训练语言模型综述：Pre-trained Models for Natural Language Processing: A Survey

【论文翻译】2020最新预训练语言模型综述：Pre-trained Models for Natural Language Processing: A Survey

专知会员服务

94+阅读 · 2020年4月13日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

征稿 | CFP：Special Issue of NLP and KG(JCR Q2，IF2.67)

征稿 | CFP：Special Issue of NLP and KG(JCR Q2，IF2.67)

开放知识图谱

1+阅读 · 2022年4月4日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

基于景观格局演变的泾河上游土壤水空间格局形成机制与尺度效应

国家自然科学基金

0+阅读 · 2014年12月31日

基于多模式集合和贝叶斯模型平均的土壤湿度模拟及不确定性研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于FrameNet的中文评价词汇本体构建与观点挖掘研究

国家自然科学基金

1+阅读 · 2013年12月31日

OsDCL3b基因调控稻穗生长发育的遗传机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于牛XY精子差异表达基因的性控DNA疫苗研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于土壤冻融信息获取新方法的复杂农田冻融模型检验与参数解译

国家自然科学基金

0+阅读 · 2011年12月31日

基于卵巢组织差异表达基因的多浪羊常年发情的分子遗传基础

国家自然科学基金

0+阅读 · 2009年12月31日

基于实例动态泛化的共指消解

国家自然科学基金

0+阅读 · 2009年12月31日

基于多变量耦合的双螺杆捏合机型线设计方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

鄱阳湖流域碳水循环对植被恢复的响应

国家自然科学基金

0+阅读 · 2008年12月31日

Rethinking the Value of Gazetteer in Chinese Named Entity Recognition

Arxiv

0+阅读 · 2022年7月18日

Win-Win Cooperation: Bundling Sequence and Span Models for Named Entity Recognition

Arxiv

0+阅读 · 2022年7月16日

Pre-Trained Models: Past, Present and Future

Arxiv

19+阅读 · 2021年6月15日

Pre-trained Models for Natural Language Processing: A Survey

Arxiv

113+阅读 · 2020年3月18日

CAN-NER: Convolutional Attention Network forChinese Named Entity Recognition

Arxiv

16+阅读 · 2019年4月3日

A Survey on Deep Learning for Named Entity Recognition

A Survey on Deep Learning for Named Entity Recognition

Arxiv

73+阅读 · 2018年12月22日

Multilingual Sentiment Analysis: An RNN-Based Framework for Limited Data

Arxiv

12+阅读 · 2018年6月8日

Label-aware Double Transfer Learning for Cross-Specialty Medical Named Entity Recognition

Arxiv

10+阅读 · 2018年4月28日

Incorporating Dictionaries into Deep Neural Networks for the Chinese Clinical Named Entity Recognition

Arxiv

12+阅读 · 2018年4月13日

Deep Active Learning for Named Entity Recognition

Arxiv

15+阅读 · 2018年2月4日

VIP会员

文章信息

相关主题

语言模型化

命名实体识别

相关VIP内容

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

专知会员服务

61+阅读 · 2020年5月15日

【论文翻译】2020最新预训练语言模型综述：Pre-trained Models for Natural Language Processing: A Survey

【论文翻译】2020最新预训练语言模型综述：Pre-trained Models for Natural Language Processing: A Survey

专知会员服务

94+阅读 · 2020年4月13日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

从社会学实验到行为仿真：理解基于Agent的观点动力学建模思维

中英文版《GPT-5 System Card速览》报告

ACL 2025 | 大模型结构化知识提示的泛化能力研究

【普林斯顿博士论文】大型模型的高效推理

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

征稿 | CFP：Special Issue of NLP and KG(JCR Q2，IF2.67)

征稿 | CFP：Special Issue of NLP and KG(JCR Q2，IF2.67)

开放知识图谱

1+阅读 · 2022年4月4日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

Rethinking the Value of Gazetteer in Chinese Named Entity Recognition

Arxiv

0+阅读 · 2022年7月18日

Win-Win Cooperation: Bundling Sequence and Span Models for Named Entity Recognition

Arxiv

0+阅读 · 2022年7月16日

Pre-Trained Models: Past, Present and Future

Arxiv

19+阅读 · 2021年6月15日

Pre-trained Models for Natural Language Processing: A Survey

Arxiv

113+阅读 · 2020年3月18日

CAN-NER: Convolutional Attention Network forChinese Named Entity Recognition

Arxiv

16+阅读 · 2019年4月3日

A Survey on Deep Learning for Named Entity Recognition

A Survey on Deep Learning for Named Entity Recognition

Arxiv

73+阅读 · 2018年12月22日

Multilingual Sentiment Analysis: An RNN-Based Framework for Limited Data

Arxiv

12+阅读 · 2018年6月8日

Label-aware Double Transfer Learning for Cross-Specialty Medical Named Entity Recognition

Arxiv

10+阅读 · 2018年4月28日

Incorporating Dictionaries into Deep Neural Networks for the Chinese Clinical Named Entity Recognition

Arxiv

12+阅读 · 2018年4月13日

Deep Active Learning for Named Entity Recognition

Arxiv

15+阅读 · 2018年2月4日

相关基金

基于景观格局演变的泾河上游土壤水空间格局形成机制与尺度效应

国家自然科学基金

0+阅读 · 2014年12月31日

基于多模式集合和贝叶斯模型平均的土壤湿度模拟及不确定性研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于FrameNet的中文评价词汇本体构建与观点挖掘研究

国家自然科学基金

1+阅读 · 2013年12月31日

OsDCL3b基因调控稻穗生长发育的遗传机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于牛XY精子差异表达基因的性控DNA疫苗研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于土壤冻融信息获取新方法的复杂农田冻融模型检验与参数解译

国家自然科学基金

0+阅读 · 2011年12月31日

基于卵巢组织差异表达基因的多浪羊常年发情的分子遗传基础

国家自然科学基金

0+阅读 · 2009年12月31日

基于实例动态泛化的共指消解

国家自然科学基金

0+阅读 · 2009年12月31日

基于多变量耦合的双螺杆捏合机型线设计方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

鄱阳湖流域碳水循环对植被恢复的响应

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员