蛋白质语言模型下的中间填充蛋白序列设计：ProtFIM (ProtFIM: Fill-in-Middle Protein Sequence Design via Protein Language Models) - 专知论文

会员服务 ·

0

序列设计 · 残基 · 蛋白质工程 · 蛋白序列设计 · 语言模型 ·

2023 年 3 月 29 日

ProtFIM: Fill-in-Middle Protein Sequence Design via Protein Language Models

翻译：蛋白质语言模型下的中间填充蛋白序列设计：ProtFIM

Youhan Lee,Hasun Yu

from arxiv, Preprint

Protein language models (pLMs), pre-trained via causal language modeling on protein sequences, have been a promising tool for protein sequence design. In real-world protein engineering, there are many cases where the amino acids in the middle of a protein sequence are optimized while maintaining other residues. Unfortunately, because of the left-to-right nature of pLMs, existing pLMs modify suffix residues by prompting prefix residues, which are insufficient for the infilling task that considers the whole surrounding context. To find the more effective pLMs for protein engineering, we design a new benchmark, Secondary structureE InFilling rEcoveRy, SEIFER, which approximates infilling sequence design scenarios. With the evaluation of existing models on the benchmark, we reveal the weakness of existing language models and show that language models trained via fill-in-middle transformation, called ProtFIM, are more appropriate for protein engineering. Also, we prove that ProtFIM generates protein sequences with decent protein representations through exhaustive experiments and visualizations.

翻译：摘要：预训练蛋白语言模型 (pLMs) 已成为蛋白质序列设计的有力工具。在实际蛋白质工程中，有很多情况下需要优化蛋白序列中间的氨基酸，同时保持其他残基。不幸的是，由于现有的 pLMs 的从左到右编码方式，存在这样的问题：输入前缀残基时，仅从后缀残基得出提示，这对于考虑整个上下文的中间填充任务是不足够的。为了找到更有效的蛋白质工程 pLMs，作者构建了一个新的基准测试体系，即二级结构填充恢复（Secondary structureE InFilling rEcoveRy, SEIFER）。通过现有模型在基准测试上进行评估，作者揭示了现有语言模型的缺点并表明，通过中间填充转换训练的语言模型所生成的 ProtFIM 对于蛋白质工程是更合适的。此外，文中通过详尽的实验和可视化证明，ProtFIM 生成的蛋白质序列具有良好的蛋白质表征。

0

相关内容

序列设计

Nat. Biotechnol. | 使用语言模型和深度学习的单序列蛋白质结构预测

Nat. Biotechnol. | 使用语言模型和深度学习的单序列蛋白质结构预测

专知会员服务

10+阅读 · 2022年10月17日

Science | ProteinMPNN : 基于深度学习的蛋白序列设计

Science | ProteinMPNN : 基于深度学习的蛋白序列设计

专知会员服务

11+阅读 · 2022年9月18日

Nat. Commun. | 用于蛋白质设计的深度无监督语言模型ProtGPT2

Nat. Commun. | 用于蛋白质设计的深度无监督语言模型ProtGPT2

专知会员服务

12+阅读 · 2022年8月1日

NLP预训练模型用于蛋白质组学｜英国女王大学207页博士论文

NLP预训练模型用于蛋白质组学｜英国女王大学207页博士论文

专知会员服务

27+阅读 · 2022年5月19日

【哈佛大学】使用AlphaFold估算蛋白质模型精度的最新技术，State-of-the-Art Estimation of Protein Model Accuracy using AlphaFold

【哈佛大学】使用AlphaFold估算蛋白质模型精度的最新技术，State-of-the-Art Estimation of Protein Model Accuracy using AlphaFold

专知会员服务

21+阅读 · 2022年3月14日

【伯克利Roshan Rao博士论文】训练，评估和理解蛋白质序列的进化模型，Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

【伯克利Roshan Rao博士论文】训练，评估和理解蛋白质序列的进化模型，Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

专知会员服务

17+阅读 · 2022年3月6日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

【论文翻译】NLP注意力机制综述论文翻译，Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

【论文翻译】NLP注意力机制综述论文翻译，Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

专知会员服务

96+阅读 · 2020年4月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

NLP预训练模型用于蛋白质组学｜英国女王大学207页博士论文

NLP预训练模型用于蛋白质组学｜英国女王大学207页博士论文

GenomicAI

0+阅读 · 2022年5月14日

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

AI前线

12+阅读 · 2019年7月22日

一文详解Google最新NLP模型XLNet

一文详解Google最新NLP模型XLNet

PaperWeekly

18+阅读 · 2019年7月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

中文版-BERT-预训练的深度双向Transformer语言模型-详细介绍

中文版-BERT-预训练的深度双向Transformer语言模型-详细介绍

深度学习与NLP

30+阅读 · 2019年3月30日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

基于结构与序列信息的蛋白质-配体结合位点的预测

国家自然科学基金

8+阅读 · 2015年12月31日

以大豆粕蛋白为原料可控制备蛋白质基表面活性剂机理研究

国家自然科学基金

0+阅读 · 2014年12月31日

小分子TRF2抑制剂的设计、合成及表征

国家自然科学基金

0+阅读 · 2014年12月31日

微流场中蚕丝蛋白结构变化定量研究

国家自然科学基金

0+阅读 · 2013年12月31日

新型维甲酸受体（RAR）激动剂的筛选及其功能调控机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

负载碳多孔有机插层LDHs的组装及对氯酚的增强吸附机理与选择性

国家自然科学基金

0+阅读 · 2012年12月31日

蛋白质与蛋白质的结合位点结构比对方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

IRES调控EV71神经毒性的分子机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

TR3相互作用新蛋白机理研究

国家自然科学基金

1+阅读 · 2008年12月31日

核磁共振研究蛋白质内含子的溶液性质与蛋白剪接机理

国家自然科学基金

0+阅读 · 2008年12月31日

Scaling laws for language encoding models in fMRI

Arxiv

0+阅读 · 2023年5月19日

Reducing Sequence Length by Predicting Edit Operations with Large Language Models

Arxiv

0+阅读 · 2023年5月19日

Marginalized Beam Search Algorithms for Hierarchical HMMs

Arxiv

0+阅读 · 2023年5月19日

Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields

Arxiv

0+阅读 · 2023年5月19日

Post Hoc Explanations of Language Models Can Improve Language Models

Arxiv

0+阅读 · 2023年5月19日

Your diffusion model secretly knows the dimension of the data manifold

Arxiv

0+阅读 · 2023年5月18日

Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

Arxiv

2+阅读 · 2023年5月18日

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

Arxiv

0+阅读 · 2023年5月17日

Transformers in Remote Sensing: A Survey

Transformers in Remote Sensing: A Survey

Arxiv

25+阅读 · 2022年9月2日

A Survey of Visual Transformers

Arxiv

39+阅读 · 2021年11月11日

VIP会员

文章信息

相关主题

蛋白质工程

蛋白序列设计

相关VIP内容

Nat. Biotechnol. | 使用语言模型和深度学习的单序列蛋白质结构预测

Nat. Biotechnol. | 使用语言模型和深度学习的单序列蛋白质结构预测

专知会员服务

10+阅读 · 2022年10月17日

Science | ProteinMPNN : 基于深度学习的蛋白序列设计

Science | ProteinMPNN : 基于深度学习的蛋白序列设计

专知会员服务

11+阅读 · 2022年9月18日

Nat. Commun. | 用于蛋白质设计的深度无监督语言模型ProtGPT2

Nat. Commun. | 用于蛋白质设计的深度无监督语言模型ProtGPT2

专知会员服务

12+阅读 · 2022年8月1日

NLP预训练模型用于蛋白质组学｜英国女王大学207页博士论文

NLP预训练模型用于蛋白质组学｜英国女王大学207页博士论文

专知会员服务

27+阅读 · 2022年5月19日

【哈佛大学】使用AlphaFold估算蛋白质模型精度的最新技术，State-of-the-Art Estimation of Protein Model Accuracy using AlphaFold

【哈佛大学】使用AlphaFold估算蛋白质模型精度的最新技术，State-of-the-Art Estimation of Protein Model Accuracy using AlphaFold

专知会员服务

21+阅读 · 2022年3月14日

【伯克利Roshan Rao博士论文】训练，评估和理解蛋白质序列的进化模型，Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

【伯克利Roshan Rao博士论文】训练，评估和理解蛋白质序列的进化模型，Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

专知会员服务

17+阅读 · 2022年3月6日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

【论文翻译】NLP注意力机制综述论文翻译，Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

【论文翻译】NLP注意力机制综述论文翻译，Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

专知会员服务

96+阅读 · 2020年4月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

《小型无人机系统侦测追踪技术：声学、计算机视觉与深度学习融合方案》最新98页

《"牧羊人网格"拦截策略：实现无人机集群可靠拦截的新范式》

光纤无人机：反无人机系统的重大挑战

《作战建模与仿真实证研究》

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

NLP预训练模型用于蛋白质组学｜英国女王大学207页博士论文

NLP预训练模型用于蛋白质组学｜英国女王大学207页博士论文

GenomicAI

0+阅读 · 2022年5月14日

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

最强NLP预训练模型库PyTorch-Transformers正式开源！支持6个预训练框架，27个预训练模型

AI前线

12+阅读 · 2019年7月22日

一文详解Google最新NLP模型XLNet

一文详解Google最新NLP模型XLNet

PaperWeekly

18+阅读 · 2019年7月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

中文版-BERT-预训练的深度双向Transformer语言模型-详细介绍

中文版-BERT-预训练的深度双向Transformer语言模型-详细介绍

深度学习与NLP

30+阅读 · 2019年3月30日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

相关论文

Scaling laws for language encoding models in fMRI

Arxiv

0+阅读 · 2023年5月19日

Reducing Sequence Length by Predicting Edit Operations with Large Language Models

Arxiv

0+阅读 · 2023年5月19日

Marginalized Beam Search Algorithms for Hierarchical HMMs

Arxiv

0+阅读 · 2023年5月19日

Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields

Arxiv

0+阅读 · 2023年5月19日

Post Hoc Explanations of Language Models Can Improve Language Models

Arxiv

0+阅读 · 2023年5月19日

Your diffusion model secretly knows the dimension of the data manifold

Arxiv

0+阅读 · 2023年5月18日

Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

Arxiv

2+阅读 · 2023年5月18日

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

Arxiv

0+阅读 · 2023年5月17日

Transformers in Remote Sensing: A Survey

Transformers in Remote Sensing: A Survey

Arxiv

25+阅读 · 2022年9月2日

A Survey of Visual Transformers

Arxiv

39+阅读 · 2021年11月11日

相关基金

基于结构与序列信息的蛋白质-配体结合位点的预测

国家自然科学基金

8+阅读 · 2015年12月31日

以大豆粕蛋白为原料可控制备蛋白质基表面活性剂机理研究

国家自然科学基金

0+阅读 · 2014年12月31日

小分子TRF2抑制剂的设计、合成及表征

国家自然科学基金

0+阅读 · 2014年12月31日

微流场中蚕丝蛋白结构变化定量研究

国家自然科学基金

0+阅读 · 2013年12月31日

新型维甲酸受体（RAR）激动剂的筛选及其功能调控机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

负载碳多孔有机插层LDHs的组装及对氯酚的增强吸附机理与选择性

国家自然科学基金

0+阅读 · 2012年12月31日

蛋白质与蛋白质的结合位点结构比对方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

IRES调控EV71神经毒性的分子机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

TR3相互作用新蛋白机理研究

国家自然科学基金

1+阅读 · 2008年12月31日

核磁共振研究蛋白质内含子的溶液性质与蛋白剪接机理

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员