Kleister:涉及具有复杂布局的长文件的关键信息提取数据集 (Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts) - 专知论文

会员服务 ·

0

entity · 信息抽取 · INFORMS · 数据集 · LayoutLM ·

2021 年 5 月 12 日

Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts

翻译：Kleister:涉及具有复杂布局的长文件的关键信息提取数据集

Tomasz Stanisławek,Filip Graliński,Anna Wróblewska,Dawid Lipiński,Agnieszka Kaliska,Paulina Rosalska,Bartosz Topolski,Przemysław Biecek

from arxiv, accepted to ICDAR 2021

The relevance of the Key Information Extraction (KIE) task is increasingly important in natural language processing problems. But there are still only a few well-defined problems that serve as benchmarks for solutions in this area. To bridge this gap, we introduce two new datasets (Kleister NDA and Kleister Charity). They involve a mix of scanned and born-digital long formal English-language documents. In these datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister Charity dataset consists of 2,788 annual financial reports of charity organizations, with 61,643 unique pages and 21,612 entities to extract. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract. We provide several state-of-the-art baseline systems from the KIE domain (Flair, BERT, RoBERTa, LayoutLM, LAMBERT), which show that our datasets pose a strong challenge to existing models. The best model achieved an 81.77% and an 83.57% F1-score on respectively the Kleister NDA and the Kleister Charity datasets. We share the datasets to encourage progress on more in-depth and complex information extraction tasks.

翻译：关键信息提取(KIE)任务的相关性在自然语言处理问题中越来越重要。但是,仍然只有几个明确界定的问题,成为这一领域解决办法的基准。为了弥合这一差距,我们引入了两个新的数据集(Kleister NDA和Kleister Charlister)。它们包含扫描和出生数字式长长的英文正式文件的组合。在这些数据集中,预计一个NLP系统将使用文字和结构布局功能,找到或推断各种类型的实体。Kleister慈善数据集由慈善组织的年度财务报告2 788份组成,其中61 643页是独一无二的,21 612个实体要提取。Kleister NDA数据集有540个非披露协议,其中3 229页是独特的,2 160个实体要提取。我们提供了来自KIEE域(Flair、BERT、RoBERT、DLM、LM、LMLEMERT)的一些最先进的基线系统,其中显示,我们的数据集对现有模型构成强烈的挑战。最佳模型在81.77%和83.57%的FIAISTO分别提供了我们的核心数据。

3

相关内容

entity

【ACL2021】预训练语言模型的少样本知识图谱文本生成

专知会员服务

39+阅读 · 2021年6月6日

【浙江大学】计算摄影学 (Computational Photography)课程

【浙江大学】计算摄影学 (Computational Photography)课程

专知会员服务

29+阅读 · 2020年12月26日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日

【ACL2020-Allen AI】预训练语言模型中的无监督域聚类

【ACL2020-Allen AI】预训练语言模型中的无监督域聚类

专知会员服务

24+阅读 · 2020年4月7日

【2020新书】Python大数据处理，Mastering Large Datasets with Python

【2020新书】Python大数据处理，Mastering Large Datasets with Python

专知会员服务

54+阅读 · 2020年2月2日

【2020新书】Python大数据处理，Mastering Large Datasets with Python，311页pdf

【2020新书】Python大数据处理，Mastering Large Datasets with Python，311页pdf

专知会员服务

197+阅读 · 2020年2月1日

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

专知会员服务

33+阅读 · 2020年1月5日

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

专知会员服务

25+阅读 · 2019年12月26日

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

专知会员服务

92+阅读 · 2019年12月22日

【ECML-PKDD 2019】基于种子样本的Web数据抽取（Web Data Extraction with Seed Samples）

【ECML-PKDD 2019】基于种子样本的Web数据抽取（Web Data Extraction with Seed Samples）

专知会员服务

8+阅读 · 2019年12月3日

计算机 | 国际会议信息5条

计算机 | 国际会议信息5条

Call4Papers

3+阅读 · 2019年7月3日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

大数据 | 顶级SCI期刊专刊/国际会议信息7条

大数据 | 顶级SCI期刊专刊/国际会议信息7条

Call4Papers

10+阅读 · 2018年12月29日

人工智能 | AAAI 2019等国际会议信息7条

人工智能 | AAAI 2019等国际会议信息7条

Call4Papers

5+阅读 · 2018年9月3日

LibRec 精选：连通知识图谱与推荐系统

LibRec 精选：连通知识图谱与推荐系统

LibRec智能推荐

3+阅读 · 2018年8月9日

【论文推荐】最新八篇知识图谱相关论文—神经信息检索、可解释推理网络、Zero-Shot、上下文、Attentive RNN

【论文推荐】最新八篇知识图谱相关论文—神经信息检索、可解释推理网络、Zero-Shot、上下文、Attentive RNN

专知

9+阅读 · 2018年6月11日

carla 学习笔记

carla 学习笔记

CreateAMind

9+阅读 · 2018年2月7日

Spatial Dependency Parsing for Semi-Structured Document Information Extraction

Arxiv

0+阅读 · 2021年7月1日

HySPA: Hybrid Span Generation for Scalable Text-to-Graph Extraction

Arxiv

0+阅读 · 2021年6月30日

Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution

Arxiv

10+阅读 · 2021年1月24日

RECON: Relation Extraction using Knowledge Graph Context in a Graph Neural Network

RECON: Relation Extraction using Knowledge Graph Context in a Graph Neural Network

Arxiv

4+阅读 · 2020年9月18日

Span-based Joint Entity and Relation Extraction with Transformer Pre-training

Arxiv

7+阅读 · 2019年9月17日

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Arxiv

3+阅读 · 2019年5月10日

Span Based Open Information Extraction

Arxiv

3+阅读 · 2019年3月1日

QA4IE: A Question Answering based Framework for Information Extraction

Arxiv

4+阅读 · 2019年1月28日

Improving Tree-LSTM with Tree Attention

Arxiv

4+阅读 · 2019年1月1日

Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation

Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation

Arxiv

5+阅读 · 2018年9月24日

VIP会员

文章信息

相关主题

相关VIP内容

【ACL2021】预训练语言模型的少样本知识图谱文本生成

专知会员服务

39+阅读 · 2021年6月6日

【浙江大学】计算摄影学 (Computational Photography)课程

【浙江大学】计算摄影学 (Computational Photography)课程

专知会员服务

29+阅读 · 2020年12月26日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日

【ACL2020-Allen AI】预训练语言模型中的无监督域聚类

【ACL2020-Allen AI】预训练语言模型中的无监督域聚类

专知会员服务

24+阅读 · 2020年4月7日

【2020新书】Python大数据处理，Mastering Large Datasets with Python

【2020新书】Python大数据处理，Mastering Large Datasets with Python

专知会员服务

54+阅读 · 2020年2月2日

【2020新书】Python大数据处理，Mastering Large Datasets with Python，311页pdf

【2020新书】Python大数据处理，Mastering Large Datasets with Python，311页pdf

专知会员服务

197+阅读 · 2020年2月1日

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

【中科院自动化所】序列到序列语音识别的无监督预训练（Unsupervised pre-training for sequence to sequence speech recognition）

专知会员服务

33+阅读 · 2020年1月5日

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

专知会员服务

25+阅读 · 2019年12月26日

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

专知会员服务

92+阅读 · 2019年12月22日

【ECML-PKDD 2019】基于种子样本的Web数据抽取（Web Data Extraction with Seed Samples）

【ECML-PKDD 2019】基于种子样本的Web数据抽取（Web Data Extraction with Seed Samples）

专知会员服务

8+阅读 · 2019年12月3日

热门VIP内容

开通专知VIP会员享更多权益服务

最新《扩散模型原理》新书，470页pdf

无人机作战：演进、创新与未来战场

AI 智能体简史

多模态空间推理在大模型时代：综述与基准测试

相关资讯

计算机 | 国际会议信息5条

计算机 | 国际会议信息5条

Call4Papers

3+阅读 · 2019年7月3日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

大数据 | 顶级SCI期刊专刊/国际会议信息7条

大数据 | 顶级SCI期刊专刊/国际会议信息7条

Call4Papers

10+阅读 · 2018年12月29日

人工智能 | AAAI 2019等国际会议信息7条

人工智能 | AAAI 2019等国际会议信息7条

Call4Papers

5+阅读 · 2018年9月3日

LibRec 精选：连通知识图谱与推荐系统

LibRec 精选：连通知识图谱与推荐系统

LibRec智能推荐

3+阅读 · 2018年8月9日

【论文推荐】最新八篇知识图谱相关论文—神经信息检索、可解释推理网络、Zero-Shot、上下文、Attentive RNN

【论文推荐】最新八篇知识图谱相关论文—神经信息检索、可解释推理网络、Zero-Shot、上下文、Attentive RNN

专知

9+阅读 · 2018年6月11日

carla 学习笔记

carla 学习笔记

CreateAMind

9+阅读 · 2018年2月7日

相关论文

Spatial Dependency Parsing for Semi-Structured Document Information Extraction

Arxiv

0+阅读 · 2021年7月1日

HySPA: Hybrid Span Generation for Scalable Text-to-Graph Extraction

Arxiv

0+阅读 · 2021年6月30日

Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution

Arxiv

10+阅读 · 2021年1月24日

RECON: Relation Extraction using Knowledge Graph Context in a Graph Neural Network

RECON: Relation Extraction using Knowledge Graph Context in a Graph Neural Network

Arxiv

4+阅读 · 2020年9月18日

Span-based Joint Entity and Relation Extraction with Transformer Pre-training

Arxiv

7+阅读 · 2019年9月17日

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Arxiv

3+阅读 · 2019年5月10日

Span Based Open Information Extraction

Arxiv

3+阅读 · 2019年3月1日

QA4IE: A Question Answering based Framework for Information Extraction

Arxiv

4+阅读 · 2019年1月28日

Improving Tree-LSTM with Tree Attention

Arxiv

4+阅读 · 2019年1月1日

Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation

Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation

Arxiv

5+阅读 · 2018年9月24日

微信扫码咨询专知VIP会员