Information Extraction Study: Take In Mind the Tokenization! (An Information Extraction Study: Take In Mind the Tokenization!) - 专知论文

会员服务 ·

0

无标记 · INFORMS · 信息抽取 · 提取 · 词元分析器 ·

2023 年 4 月 1 日

An Information Extraction Study: Take In Mind the Tokenization!

翻译：Information Extraction Study: Take In Mind the Tokenization!

Christos Theodoropoulos,Marie-Francine Moens

from arxiv, Submitted Manuscript/Preprint (accepted at EUSFLAT 2023, to be published in Lecture Notes in Computer Science (LNCS))

Current research on the advantages and trade-offs of using characters, instead of tokenized text, as input for deep learning models, has evolved substantially. New token-free models remove the traditional tokenization step; however, their efficiency remains unclear. Moreover, the effect of tokenization is relatively unexplored in sequence tagging tasks. To this end, we investigate the impact of tokenization when extracting information from documents and present a comparative study and analysis of subword-based and character-based models. Specifically, we study Information Extraction (IE) from biomedical texts. The main outcome is twofold: tokenization patterns can introduce inductive bias that results in state-of-the-art performance, and the character-based models produce promising results; thus, transitioning to token-free IE models is feasible.

翻译：当前有关将字符作为深度学习模型输入的优势和权衡的研究已经得到了很大发展。新的无标记模型移除了传统的标记化步骤，但它们的效率尚不清楚。此外，在序列标记任务中，标记化的影响相对未知。为此，我们调查了从文档中提取信息时标记化的影响，并对基于子词和基于字符的模型进行了比较研究和分析。具体而言，我们研究了从生物医学文本中提取信息。主要结果如下：标记化模式会引入归纳偏差，导致最先进的性能；而基于字符的模型产生了有希望的结果，因此，过渡到无标记化的信息提取模型是可行的。

0

相关内容

无标记

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

论文浅尝 | Global Relation Embedding for Relation Extraction

论文浅尝 | Global Relation Embedding for Relation Extraction

开放知识图谱

12+阅读 · 2019年3月3日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

基于Dectin-1受体识别的酵母葡聚糖酶解片段的链结构及构效关系的研究

国家自然科学基金

0+阅读 · 2013年12月31日

定性地理信息检索的模型与方法

国家自然科学基金

0+阅读 · 2012年12月31日

柽柳Dof转录因子的耐盐调控机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

猪细胞色素P450氧化酶3A46的催化及调控分子机制

国家自然科学基金

0+阅读 · 2012年12月31日

UGT基因簇进化及调控研究

国家自然科学基金

0+阅读 · 2009年12月31日

The Role of Output Vocabulary in T2T LMs for SPARQL Semantic Parsing

Arxiv

0+阅读 · 2023年5月24日

A Human-in-the-Loop Approach for Information Extraction from Privacy Policies under Data Scarcity

Arxiv

0+阅读 · 2023年5月24日

WebIE: Faithful and Robust Information Extraction on the Web

WebIE: Faithful and Robust Information Extraction on the Web

Arxiv

0+阅读 · 2023年5月23日

Leveraging Open Information Extraction for Improving Few-Shot Trigger Detection Domain Transfer

Arxiv

0+阅读 · 2023年5月23日

Easy-to-Hard Learning for Information Extraction

Arxiv

0+阅读 · 2023年5月19日

VIP会员

文章信息

相关主题

词元分析器

相关VIP内容

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

人机协同时代的军事指挥控制演进

《英国智库：瓦解俄罗斯防空系统生产，夺回制空权》最新报告

《通过仿真与开源数据提升战略决策：机遇与局限》最新报告

《战术突击工具包：军队的“边缘”操作系统》报告

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

论文浅尝 | Global Relation Embedding for Relation Extraction

论文浅尝 | Global Relation Embedding for Relation Extraction

开放知识图谱

12+阅读 · 2019年3月3日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

相关论文

The Role of Output Vocabulary in T2T LMs for SPARQL Semantic Parsing

Arxiv

0+阅读 · 2023年5月24日

A Human-in-the-Loop Approach for Information Extraction from Privacy Policies under Data Scarcity

Arxiv

0+阅读 · 2023年5月24日

WebIE: Faithful and Robust Information Extraction on the Web

WebIE: Faithful and Robust Information Extraction on the Web

Arxiv

0+阅读 · 2023年5月23日

Leveraging Open Information Extraction for Improving Few-Shot Trigger Detection Domain Transfer

Arxiv

0+阅读 · 2023年5月23日

Easy-to-Hard Learning for Information Extraction

Arxiv

0+阅读 · 2023年5月19日

相关基金

基于Dectin-1受体识别的酵母葡聚糖酶解片段的链结构及构效关系的研究

国家自然科学基金

0+阅读 · 2013年12月31日

定性地理信息检索的模型与方法

国家自然科学基金

0+阅读 · 2012年12月31日

柽柳Dof转录因子的耐盐调控机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

猪细胞色素P450氧化酶3A46的催化及调控分子机制

国家自然科学基金

0+阅读 · 2012年12月31日

UGT基因簇进化及调控研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员