结构化信息取自复杂的科学文本,并配有微调大语言模型 (Structured information extraction from complex scientific text with fine-tuned large language models) - 专知论文

会员服务 ·

0

INFORMS · 语言模型化 · SimPLe · Unstructured · 信息抽取 ·

2022 年 12 月 10 日

Structured information extraction from complex scientific text with fine-tuned large language models

翻译：结构化信息取自复杂的科学文本,并配有微调大语言模型

Alexander Dunn,John Dagdelen,Nicholas Walker,Sanghoon Lee,Andrew S. Rosen,Gerbrand Ceder,Kristin Persson,Anubhav Jain

Intelligently extracting and linking complex scientific information from unstructured text is a challenging endeavor particularly for those inexperienced with natural language processing. Here, we present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction for complex hierarchical information in scientific text. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts (inputs) and completions (outputs). Information is extracted either from single sentences or across sentences in abstracts/passages, and the output can be returned as simple English sentences or a more structured format, such as a list of JSON objects. We demonstrate that LLMs trained in this way are capable of accurately extracting useful records of complex scientific knowledge for three representative tasks in materials chemistry: linking dopants with their host materials, cataloging metal-organic frameworks, and general chemistry/phase/morphology/application information extraction. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text. An online demo is available at http://www.matscholar.com/info-extraction.

翻译：从非结构化文本中明智地提取和链接复杂的科学信息是一项艰巨的任务,对那些没有自然语言处理经验的人来说尤其如此。在这里,我们提出了一个简单的顺序到顺序的方法,用于联合命名实体的确认和科学文本中复杂等级信息的关系提取。这种方法利用了预先训练的大型语言模型(LLM)、GPT-3,该模型对大约500对提示(投入)和完成(产出)进行了微调,从单句或摘要/访问中各句中提取信息,而产出可以作为简单的英语句子或结构化更强的格式(如JSON物体清单)返回。我们证明,通过这种方式培训的LLMS能够准确地提取复杂的科学知识记录,用于材料化学方面的三项具有代表性的任务:将工具与主机床材料联系起来,对金属-有机框架进行编目,以及一般化学/阶段/形态学/应用信息提取。这一方法是一个简单、易懂和高度灵活的路径,可以获取从结构化文本中提取的大型结构化知识数据库。我们可在http://www.mostroporalexexexexaction。

0

相关内容

INFORMS

《计算机信息》杂志发表高质量的论文，扩大了运筹学和计算的范围，寻求有关理论、方法、实验、系统和应用方面的原创研究论文、新颖的调查和教程论文，以及描述新的和有用的软件工具的论文。官网链接：https://pubsonline.informs.org/journal/ijoc

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

50+阅读 · 2022年10月2日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

ExBert — 可视化分析Transformer学到的表示

ExBert — 可视化分析Transformer学到的表示

专知会员服务

32+阅读 · 2019年10月16日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium5

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium5

中国图象图形学学会CSIG

1+阅读 · 2021年11月11日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

中国图象图形学学会CSIG

0+阅读 · 2021年11月10日

【ICIG2021】Latest News & Announcements of the Plenary Talk2

【ICIG2021】Latest News & Announcements of the Plenary Talk2

中国图象图形学学会CSIG

0+阅读 · 2021年11月2日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

莲子草假隔链格孢菌2个特异致病毒素的基因鉴定及其作用机理研究

国家自然科学基金

0+阅读 · 2016年12月31日

Tip49a/Tip49b及其相关复合物的冷冻电镜结构研究

国家自然科学基金

0+阅读 · 2015年12月31日

Kronheimer-Nakajima quiver 模空间与有理曲面

国家自然科学基金

1+阅读 · 2013年12月31日

MicRNA107调控BACE1mRNA基因与阿尔茨海默病内质网应激病理机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

新型免疫负调控分子TIPE2调控CD4+T细胞的功能及在HBV感染中的作用研究

国家自然科学基金

0+阅读 · 2012年12月31日

血小板上调乳腺癌细胞膜的整合素表达进而促进癌细胞转移的分子机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

早幼粒细胞白血病锌指基因（PLZF）变异对小鼠骨骼和软骨发育的影响研究

国家自然科学基金

0+阅读 · 2009年12月31日

UGT基因簇进化及调控研究

国家自然科学基金

0+阅读 · 2009年12月31日

新型导电聚合物-铁氧体吸波杂化材料合成及其结构与性能研究

国家自然科学基金

0+阅读 · 2009年12月31日

CUEDC2/SOCS3复合物负调控JAK-STAT通路及其分子机制的研究

国家自然科学基金

0+阅读 · 2009年12月31日

Discourse Structure Extraction from Pre-Trained and Fine-Tuned Language Models in Dialogues

Arxiv

0+阅读 · 2023年2月12日

DocILE Benchmark for Document Information Localization and Extraction

Arxiv

0+阅读 · 2023年2月11日

Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models

Arxiv

0+阅读 · 2023年2月9日

Massively Multilingual Language Models for Cross Lingual Fact Extraction from Low Resource Indian Languages

Massively Multilingual Language Models for Cross Lingual Fact Extraction from Low Resource Indian Languages

Arxiv

0+阅读 · 2023年2月9日

Bag of Tricks for Training Data Extraction from Language Models

Arxiv

0+阅读 · 2023年2月9日

Pre-Trained Models: Past, Present and Future

Arxiv

19+阅读 · 2021年6月15日

Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution

Arxiv

10+阅读 · 2021年1月24日

A Survey of Knowledge-Enhanced Text Generation

Arxiv

18+阅读 · 2020年10月9日

Fine-tune BERT for Extractive Summarization

Arxiv

21+阅读 · 2019年3月25日

Multimodal Sentiment Analysis To Explore the Structure of Emotions

Arxiv

19+阅读 · 2018年5月25日

VIP会员

文章信息

相关主题

语言模型化

相关VIP内容

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

50+阅读 · 2022年10月2日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

ExBert — 可视化分析Transformer学到的表示

ExBert — 可视化分析Transformer学到的表示

专知会员服务

32+阅读 · 2019年10月16日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

热门VIP内容

开通专知VIP会员享更多权益服务

《美陆军特种作战条令》最新102页

《洛克希德SR-71“黑鸟”侦察机动力系统》21页slides

美空军作战实验室通过人工智能和指挥控制技术创新推进杀伤链

《指挥控制能力分析方法论》最新报告

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium5

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium5

中国图象图形学学会CSIG

1+阅读 · 2021年11月11日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

中国图象图形学学会CSIG

0+阅读 · 2021年11月10日

【ICIG2021】Latest News & Announcements of the Plenary Talk2

【ICIG2021】Latest News & Announcements of the Plenary Talk2

中国图象图形学学会CSIG

0+阅读 · 2021年11月2日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

相关论文

Discourse Structure Extraction from Pre-Trained and Fine-Tuned Language Models in Dialogues

Arxiv

0+阅读 · 2023年2月12日

DocILE Benchmark for Document Information Localization and Extraction

Arxiv

0+阅读 · 2023年2月11日

Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models

Arxiv

0+阅读 · 2023年2月9日

Massively Multilingual Language Models for Cross Lingual Fact Extraction from Low Resource Indian Languages

Massively Multilingual Language Models for Cross Lingual Fact Extraction from Low Resource Indian Languages

Arxiv

0+阅读 · 2023年2月9日

Bag of Tricks for Training Data Extraction from Language Models

Arxiv

0+阅读 · 2023年2月9日

Pre-Trained Models: Past, Present and Future

Arxiv

19+阅读 · 2021年6月15日

Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution

Arxiv

10+阅读 · 2021年1月24日

A Survey of Knowledge-Enhanced Text Generation

Arxiv

18+阅读 · 2020年10月9日

Fine-tune BERT for Extractive Summarization

Arxiv

21+阅读 · 2019年3月25日

Multimodal Sentiment Analysis To Explore the Structure of Emotions

Arxiv

19+阅读 · 2018年5月25日

相关基金

莲子草假隔链格孢菌2个特异致病毒素的基因鉴定及其作用机理研究

国家自然科学基金

0+阅读 · 2016年12月31日

Tip49a/Tip49b及其相关复合物的冷冻电镜结构研究

国家自然科学基金

0+阅读 · 2015年12月31日

Kronheimer-Nakajima quiver 模空间与有理曲面

国家自然科学基金

1+阅读 · 2013年12月31日

MicRNA107调控BACE1mRNA基因与阿尔茨海默病内质网应激病理机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

新型免疫负调控分子TIPE2调控CD4+T细胞的功能及在HBV感染中的作用研究

国家自然科学基金

0+阅读 · 2012年12月31日

血小板上调乳腺癌细胞膜的整合素表达进而促进癌细胞转移的分子机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

早幼粒细胞白血病锌指基因（PLZF）变异对小鼠骨骼和软骨发育的影响研究

国家自然科学基金

0+阅读 · 2009年12月31日

UGT基因簇进化及调控研究

国家自然科学基金

0+阅读 · 2009年12月31日

新型导电聚合物-铁氧体吸波杂化材料合成及其结构与性能研究

国家自然科学基金

0+阅读 · 2009年12月31日

CUEDC2/SOCS3复合物负调控JAK-STAT通路及其分子机制的研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员