从含有字体样式嵌入的视觉丰富文档中提取信息 (Information Extraction from Visually Rich Documents with Font Style Embeddings) - 专知论文

会员服务 ·

0

信息抽取 · INFORMS · Vision · LayoutLM · 词元分析器 ·

2022 年 8 月 12 日

Information Extraction from Visually Rich Documents with Font Style Embeddings

翻译：从含有字体样式嵌入的视觉丰富文档中提取信息

Ismail Oussaid,William Vanhuffel,Pirashanth Ratnamogan,Mhamed Hajaiej,Alexis Mathey,Thomas Gilles

Information extraction (IE) from documents is an intensive area of research with a large set of industrial applications. Current state-of-the-art methods focus on scanned documents with approaches combining computer vision, natural language processing and layout representation. We propose to challenge the usage of computer vision in the case where both token style and visual representation are available (i.e native PDF documents). Our experiments on three real-world complex datasets demonstrate that using token style attributes based embedding instead of a raw visual embedding in LayoutLM model is beneficial. Depending on the dataset, such an embedding yields an improvement of 0.18% to 2.29% in the weighted F1-score with a decrease of 30.7% in the final number of trainable parameters of the model, leading to an improvement in both efficiency and effectiveness.

翻译：从文档中提取信息(IE)是一个密集的研究领域,有大量的工业应用。目前最先进的方法侧重于扫描文档,结合计算机视觉、自然语言处理和布局代表等方法。我们提议在有象征性风格和视觉表述(即本地PDF文件)的情况下质疑计算机视觉的使用。我们在三个真实世界复杂的数据集上的实验表明,使用象征性风格属性嵌入而不是在布局LM模型中原始的视觉嵌入是有益的。根据数据集,这种嵌入使加权F1核心改进了0.18%至2.29%,使该模型最后可培训参数的数量减少了30.7%,从而提高了效率和有效性。

0

相关内容

信息抽取

信息抽取（Information Extraction: IE）是把文本里包含的信息进行结构化处理，变成表格一样的组织形式。输入信息抽取系统的是原始文本，输出的是固定格式的信息点。信息点从各种各样的文档中被抽取出来，然后以统一的形式集成在一起。这就是信息抽取的主要任务。信息以统一的形式集成在一起的好处是方便检查和比较。信息抽取技术并不试图全面理解整篇文档，只是对文档中包含相关信息的部分进行分析。至于哪些信息是相关的，那将由系统设计时定下的领域范围而定。

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

专知会员服务

28+阅读 · 2022年2月20日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

专知会员服务

77+阅读 · 2020年2月8日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

【ICIG2021】Latest News & Announcements of the Plenary Talk2

【ICIG2021】Latest News & Announcements of the Plenary Talk2

中国图象图形学学会CSIG

0+阅读 · 2021年11月2日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

SIRT1调控miR-15b-5p转录的新机制及其在结直肠癌转移的作用

国家自然科学基金

0+阅读 · 2015年12月31日

CeO2/过渡金属氧化物负载型褐煤半焦脱除煤气中单质汞的机理研究

国家自然科学基金

0+阅读 · 2013年12月31日

外加应力及含水蒸气环境中CoNiCrAlY涂层表面氧化层的生长机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

miR-150通过VEGF和Tie-2双重调控脑梗死后血管新生和血管渗漏

国家自然科学基金

0+阅读 · 2013年12月31日

窄滞后NiCoMnSn记忆合金薄膜及其低磁场驱动马氏体相变及磁感生应变研究

国家自然科学基金

0+阅读 · 2013年12月31日

导电聚(2-氨基噻唑)的化学氧化合成及多功能性研究

国家自然科学基金

0+阅读 · 2012年12月31日

超声微泡介导SIRT1基因对退变椎间盘调控作用及相关机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

CuInGaSe2太阳能电池界面结构、界面态及其钝化

国家自然科学基金

0+阅读 · 2012年12月31日

SIRT1对椎间盘髓核细胞凋亡调控及信号转导机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

镁合金塑性成形本构关系和摩擦系数有限元逆向确定方法及实验研究

国家自然科学基金

0+阅读 · 2009年12月31日

GNSS/MEMS-INS Integration for Drone Navigation using EKF on Lie Groups

Arxiv

0+阅读 · 2022年10月6日

Phenaki: Variable Length Video Generation From Open Domain Textual Description

Arxiv

0+阅读 · 2022年10月5日

Progressive Denoising Model for Fine-Grained Text-to-Image Generation

Arxiv

0+阅读 · 2022年10月5日

Detect, Retrieve, Comprehend: A Flexible Framework for Zero-Shot Document-Level Question Answering

Arxiv

0+阅读 · 2022年10月4日

Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning

Arxiv

0+阅读 · 2022年10月4日

Towards End-to-end Handwritten Document Recognition

Arxiv

0+阅读 · 2022年9月30日

Linearly Mapping from Image to Text Space

Arxiv

0+阅读 · 2022年9月30日

Sequence Level Contrastive Learning for Text Summarization

Sequence Level Contrastive Learning for Text Summarization

Arxiv

14+阅读 · 2021年9月24日

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Arxiv

12+阅读 · 2020年2月19日

Global Relation Embedding for Relation Extraction

Arxiv

10+阅读 · 2018年4月19日

VIP会员

文章信息

相关主题

词元分析器

相关VIP内容

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

专知会员服务

28+阅读 · 2022年2月20日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

专知会员服务

77+阅读 · 2020年2月8日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【普林斯顿博士论文】在线学习：优化、控制与学习理论

不确定环境下无人机三维路径规划研究 | 221页

【NeurIPS2025】《LeapFactual：基于条件流匹配的可靠视觉反事实解释》

大语言模型将如何改变军事指挥结构

相关资讯

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

【ICIG2021】Latest News & Announcements of the Plenary Talk2

【ICIG2021】Latest News & Announcements of the Plenary Talk2

中国图象图形学学会CSIG

0+阅读 · 2021年11月2日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

相关论文

GNSS/MEMS-INS Integration for Drone Navigation using EKF on Lie Groups

Arxiv

0+阅读 · 2022年10月6日

Phenaki: Variable Length Video Generation From Open Domain Textual Description

Arxiv

0+阅读 · 2022年10月5日

Progressive Denoising Model for Fine-Grained Text-to-Image Generation

Arxiv

0+阅读 · 2022年10月5日

Detect, Retrieve, Comprehend: A Flexible Framework for Zero-Shot Document-Level Question Answering

Arxiv

0+阅读 · 2022年10月4日

Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning

Arxiv

0+阅读 · 2022年10月4日

Towards End-to-end Handwritten Document Recognition

Arxiv

0+阅读 · 2022年9月30日

Linearly Mapping from Image to Text Space

Arxiv

0+阅读 · 2022年9月30日

Sequence Level Contrastive Learning for Text Summarization

Sequence Level Contrastive Learning for Text Summarization

Arxiv

14+阅读 · 2021年9月24日

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Arxiv

12+阅读 · 2020年2月19日

Global Relation Embedding for Relation Extraction

Arxiv

10+阅读 · 2018年4月19日

相关基金

SIRT1调控miR-15b-5p转录的新机制及其在结直肠癌转移的作用

国家自然科学基金

0+阅读 · 2015年12月31日

CeO2/过渡金属氧化物负载型褐煤半焦脱除煤气中单质汞的机理研究

国家自然科学基金

0+阅读 · 2013年12月31日

外加应力及含水蒸气环境中CoNiCrAlY涂层表面氧化层的生长机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

miR-150通过VEGF和Tie-2双重调控脑梗死后血管新生和血管渗漏

国家自然科学基金

0+阅读 · 2013年12月31日

窄滞后NiCoMnSn记忆合金薄膜及其低磁场驱动马氏体相变及磁感生应变研究

国家自然科学基金

0+阅读 · 2013年12月31日

导电聚(2-氨基噻唑)的化学氧化合成及多功能性研究

国家自然科学基金

0+阅读 · 2012年12月31日

超声微泡介导SIRT1基因对退变椎间盘调控作用及相关机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

CuInGaSe2太阳能电池界面结构、界面态及其钝化

国家自然科学基金

0+阅读 · 2012年12月31日

SIRT1对椎间盘髓核细胞凋亡调控及信号转导机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

镁合金塑性成形本构关系和摩擦系数有限元逆向确定方法及实验研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员