基于GPT和BERT的模型在医学文本中识别蛋白质相互作用的评估 (Evaluation of GPT and BERT-based models on identifying protein-protein interactions in biomedical text) - 专知论文

会员服务 ·

0

蛋白质相互作用 · BERT · 相互作用 · 模型评估 · 语料库 ·

2023 年 3 月 30 日

Evaluation of GPT and BERT-based models on identifying protein-protein interactions in biomedical text

翻译：基于GPT和BERT的模型在医学文本中识别蛋白质相互作用的评估

Hasin Rehana,Nur Bengisu Çam,Mert Basmaci,Yongqun He,Arzucan Özgür,Junguk Hur

Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms, disease pathogenesis, and drug design. However, with the fast-paced growth of biomedical literature, there is a growing need for automated and accurate extraction of PPIs to facilitate scientific knowledge discovery. Pre-trained language models, such as generative pre-trained transformer (GPT) and bidirectional encoder representations from transformers (BERT), have shown promising results in natural language processing (NLP) tasks. We evaluated the PPI identification performance of various GPT and BERT models using a manually curated benchmark corpus of 164 PPIs in 77 sentences from learning language in logic (LLL). BERT-based models achieved the best overall performance, with PubMedBERT achieving the highest precision (85.17%) and F1-score (86.47%) and BioM-ALBERT achieving the highest recall (93.83%). Despite not being explicitly trained for biomedical texts, GPT-4 achieved comparable performance to the best BERT models with 83.34% precision, 76.57% recall, and 79.18% F1-score. These findings suggest that GPT models can effectively detect PPIs from text data and have the potential for use in biomedical literature mining tasks.

翻译：检测蛋白质相互作用（PPI）对于理解遗传机制、疾病发病机制和药物设计至关重要。然而，随着生物医学文献的快速增长，需要自动化和准确的PPI提取以促进科学知识发现。预训练语言模型，如生成式预训练变压器（GPT）和双向编码器表示来自变压器（BERT），在自然语言处理（NLP）任务中表现出了良好的结果。我们使用手动筛选的学习语言逻辑（LLL）基准语料库，评估了各种GPT和BERT模型的PPI识别性能，该基准语料库包含了77个句子中的164个PPI。基于BERT的模型实现了最佳的整体表现，其中PubMedBERT在精确度（85.17％）和F1值（86.47％）方面表现最佳，而BioM-ALBERT在召回率（93.83％）方面表现最佳。尽管未经过针对生物医学文本的显式训练，但GPT-4实现了与最佳BERT模型相当的性能，精确度为83.34％，召回率为76.57％，F1值为79.18％。这些发现表明，GPT模型可以有效地从文本数据中检测到PPI，并具有在生物医学文献挖掘任务中使用的潜力。

0

相关内容

蛋白质相互作用

蛋白质相互作用

Science | ProteinMPNN : 基于深度学习的蛋白序列设计

Science | ProteinMPNN : 基于深度学习的蛋白序列设计

专知会员服务

11+阅读 · 2022年9月18日

蛋白质语言建模？伯克利RoshanRao157页博士论文《训练，评估和理解蛋白质序列的进化模型》

蛋白质语言建模？伯克利RoshanRao157页博士论文《训练，评估和理解蛋白质序列的进化模型》

专知会员服务

26+阅读 · 2022年3月22日

【伯克利Roshan Rao博士论文】训练，评估和理解蛋白质序列的进化模型，Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

【伯克利Roshan Rao博士论文】训练，评估和理解蛋白质序列的进化模型，Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

专知会员服务

17+阅读 · 2022年3月6日

预训练模型如何用于文本挖掘？看这份KDD2021-UIUC《预训练文本表示:模型与应用在文本挖掘》教程，附200页Slides

专知会员服务

44+阅读 · 2021年8月18日

【NLP模型压缩方法综述】《A Survey of Methods for Model Compression in NLP》by Madison May

【NLP模型压缩方法综述】《A Survey of Methods for Model Compression in NLP》by Madison May

专知会员服务

43+阅读 · 2020年4月22日

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

专知会员服务

26+阅读 · 2020年3月19日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

专知会员服务

69+阅读 · 2020年1月2日

【AAAI2020论文】关注实体以更好地理解文本（Attending to Entities for Better Text Understanding）

【AAAI2020论文】关注实体以更好地理解文本（Attending to Entities for Better Text Understanding）

专知会员服务

25+阅读 · 2019年11月15日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

AI前线

12+阅读 · 2019年7月22日

超越 BERT 和 GPT，微软亚洲研究院开源新模型 MASS！

超越 BERT 和 GPT，微软亚洲研究院开源新模型 MASS！

雷锋网

10+阅读 · 2019年6月27日

学界 | 超越 BERT 和 GPT，微软亚洲研究院开源新模型 MASS！

学界 | 超越 BERT 和 GPT，微软亚洲研究院开源新模型 MASS！

AI科技评论

18+阅读 · 2019年6月25日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

28+阅读 · 2019年4月22日

最新NLP论文阅读列表，包括对话、问答、摘要、翻译等（附资源）

最新NLP论文阅读列表，包括对话、问答、摘要、翻译等（附资源）

THU数据派

11+阅读 · 2019年3月25日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

用于循环肿瘤DNA检测的数字化信号放大新方法及临床应用研究

国家自然科学基金

0+阅读 · 2014年12月31日

当机器智能遇到人类计算─基于众包的分类数据挖掘技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于生物有效性的水中重金属联合毒性的预测模型及相互作用研究

国家自然科学基金

0+阅读 · 2013年12月31日

体内以DNA为中心的互作蛋白的鉴定与定量方法学研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于三维结构信息预测蛋白质相互作用及其位点的计算研究

国家自然科学基金

0+阅读 · 2013年12月31日

肝炎与艾滋病的高特异高灵敏高通量快速并行分子检测新方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

蛋白质结构模体识别及结构预测算法研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于结构的蛋白质相互作用能量热点预测技术研究

国家自然科学基金

1+阅读 · 2011年12月31日

中文医学文本中关联信息提取方法研究

国家自然科学基金

2+阅读 · 2009年12月31日

应用蛋白质组学技术筛选、鉴定循环肝癌干细胞特异性标志物

国家自然科学基金

0+阅读 · 2008年12月31日

Deepfake Text Detection in the Wild

Arxiv

0+阅读 · 2023年5月22日

Automated stance detection in complex topics and small languages: the challenging case of immigration in polarizing news media

Arxiv

0+阅读 · 2023年5月22日

IR Models and the COVID-19 Pandemic: A Comparative Study of Performance and Challenges

Arxiv

0+阅读 · 2023年5月21日

Empowering Sentence Encoders with Prompting and Label Retrieval for Zero-shot Text Classification

Arxiv

0+阅读 · 2023年5月19日

A Survey of Federated Evaluation in Federated Learning

Arxiv

0+阅读 · 2023年5月19日

A Survey on Time-Series Pre-Trained Models

Arxiv

7+阅读 · 2023年5月18日

REV: Information-Theoretic Evaluation of Free-Text Rationales

Arxiv

0+阅读 · 2023年5月18日

Transformers in Medical Imaging: A Survey

Arxiv

15+阅读 · 2022年1月24日

AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing

Arxiv

23+阅读 · 2021年8月12日

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

Arxiv

30+阅读 · 2021年7月28日

VIP会员

文章信息

相关主题

蛋白质相互作用

相关VIP内容

Science | ProteinMPNN : 基于深度学习的蛋白序列设计

Science | ProteinMPNN : 基于深度学习的蛋白序列设计

专知会员服务

11+阅读 · 2022年9月18日

蛋白质语言建模？伯克利RoshanRao157页博士论文《训练，评估和理解蛋白质序列的进化模型》

蛋白质语言建模？伯克利RoshanRao157页博士论文《训练，评估和理解蛋白质序列的进化模型》

专知会员服务

26+阅读 · 2022年3月22日

【伯克利Roshan Rao博士论文】训练，评估和理解蛋白质序列的进化模型，Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

【伯克利Roshan Rao博士论文】训练，评估和理解蛋白质序列的进化模型，Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

专知会员服务

17+阅读 · 2022年3月6日

预训练模型如何用于文本挖掘？看这份KDD2021-UIUC《预训练文本表示:模型与应用在文本挖掘》教程，附200页Slides

专知会员服务

44+阅读 · 2021年8月18日

【NLP模型压缩方法综述】《A Survey of Methods for Model Compression in NLP》by Madison May

【NLP模型压缩方法综述】《A Survey of Methods for Model Compression in NLP》by Madison May

专知会员服务

43+阅读 · 2020年4月22日

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

【预训练论文】预训练Transformer校准，Calibration of Pre-trained Transformers

专知会员服务

26+阅读 · 2020年3月19日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

专知会员服务

69+阅读 · 2020年1月2日

【AAAI2020论文】关注实体以更好地理解文本（Attending to Entities for Better Text Understanding）

【AAAI2020论文】关注实体以更好地理解文本（Attending to Entities for Better Text Understanding）

专知会员服务

25+阅读 · 2019年11月15日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

小规模训练指南：打造世界级大语言模型的关键方法

无人机编队飞行：复杂环境中作战的策略、挑战与应用

大模型APP，AI时代第一个爆款

从数据中心视角出发的高效大语言模型训练综述

相关资讯

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

AI前线

12+阅读 · 2019年7月22日

超越 BERT 和 GPT，微软亚洲研究院开源新模型 MASS！

超越 BERT 和 GPT，微软亚洲研究院开源新模型 MASS！

雷锋网

10+阅读 · 2019年6月27日

学界 | 超越 BERT 和 GPT，微软亚洲研究院开源新模型 MASS！

学界 | 超越 BERT 和 GPT，微软亚洲研究院开源新模型 MASS！

AI科技评论

18+阅读 · 2019年6月25日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

28+阅读 · 2019年4月22日

最新NLP论文阅读列表，包括对话、问答、摘要、翻译等（附资源）

最新NLP论文阅读列表，包括对话、问答、摘要、翻译等（附资源）

THU数据派

11+阅读 · 2019年3月25日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

相关论文

Deepfake Text Detection in the Wild

Arxiv

0+阅读 · 2023年5月22日

Automated stance detection in complex topics and small languages: the challenging case of immigration in polarizing news media

Arxiv

0+阅读 · 2023年5月22日

IR Models and the COVID-19 Pandemic: A Comparative Study of Performance and Challenges

Arxiv

0+阅读 · 2023年5月21日

Empowering Sentence Encoders with Prompting and Label Retrieval for Zero-shot Text Classification

Arxiv

0+阅读 · 2023年5月19日

A Survey of Federated Evaluation in Federated Learning

Arxiv

0+阅读 · 2023年5月19日

A Survey on Time-Series Pre-Trained Models

Arxiv

7+阅读 · 2023年5月18日

REV: Information-Theoretic Evaluation of Free-Text Rationales

Arxiv

0+阅读 · 2023年5月18日

Transformers in Medical Imaging: A Survey

Arxiv

15+阅读 · 2022年1月24日

AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing

Arxiv

23+阅读 · 2021年8月12日

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

Arxiv

30+阅读 · 2021年7月28日

相关基金

用于循环肿瘤DNA检测的数字化信号放大新方法及临床应用研究

国家自然科学基金

0+阅读 · 2014年12月31日

当机器智能遇到人类计算─基于众包的分类数据挖掘技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于生物有效性的水中重金属联合毒性的预测模型及相互作用研究

国家自然科学基金

0+阅读 · 2013年12月31日

体内以DNA为中心的互作蛋白的鉴定与定量方法学研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于三维结构信息预测蛋白质相互作用及其位点的计算研究

国家自然科学基金

0+阅读 · 2013年12月31日

肝炎与艾滋病的高特异高灵敏高通量快速并行分子检测新方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

蛋白质结构模体识别及结构预测算法研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于结构的蛋白质相互作用能量热点预测技术研究

国家自然科学基金

1+阅读 · 2011年12月31日

中文医学文本中关联信息提取方法研究

国家自然科学基金

2+阅读 · 2009年12月31日

应用蛋白质组学技术筛选、鉴定循环肝癌干细胞特异性标志物

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员