PSP:百万级蛋泰因结构预测蛋泰因序列数据集 (PSP: Million-level Protein Sequence Dataset for Protein Structure Prediction)

Sirui Liu,Jun Zhang,Haotian Chu,Min Wang,Boxin Xue,Ningxi Ni,Jialiang Yu,Yuhao Xie,Zhenyu Chen,Mengyun Chen,Yuan Liu,Piya Patra,Fan Xu,Jie Chen,Zidong Wang,Lijiang Yang,Fan Yu,Lei Chen,Yi Qin Gao

Proteins are essential component of human life and their structures are important for function and mechanism analysis. Recent work has shown the potential of AI-driven methods for protein structure prediction. However, the development of new models is restricted by the lack of dataset and benchmark training procedure. To the best of our knowledge, the existing open source datasets are far less to satisfy the needs of modern protein sequence-structure related research. To solve this problem, we present the first million-level protein structure prediction dataset with high coverage and diversity, named as PSP. This dataset consists of 570k true structure sequences (10TB) and 745k complementary distillation sequences (15TB). We provide in addition the benchmark training procedure for SOTA protein structure prediction model on this dataset. We validate the utility of this dataset for training by participating CAMEO contest in which our model won the first place. We hope our PSP dataset together with the training benchmark can enable a broader community of AI/biology researchers for AI-driven protein related research.

翻译：蛋白质是人类生命的基本组成部分,其结构对于功能和机制分析十分重要。最近的工作显示AI驱动的蛋白质结构预测方法的潜力。然而,由于缺乏数据集和基准培训程序,新模型的开发受到限制。根据我们的知识,现有的开放源数据集远远不能满足现代蛋白序列结构相关研究的需要。为了解决这个问题,我们提出了第一个百万级蛋白质结构预测数据集,其覆盖面和多样性都很高,称为PSP。这一数据集由570千个真实结构序列(10TB)和745千个补充蒸馏序列(15TB)组成。我们还提供了该数据集SOTA蛋白质结构预测模型的基准培训程序。我们通过参加CAMEO竞赛来验证这一数据集在培训中的效用,我们的模型在竞赛中赢得了第一位。我们希望我们的PSP数据集与培训基准一起能够让更广泛的AI驱动蛋白质相关研究的AI/生物学研究人员群。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

专知会员服务

30+阅读 · 2022年3月8日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日