How and where proteins interface with one another can ultimately impact the proteins' functions along with a range of other biological processes. As such, precise computational methods for protein interface prediction (PIP) come highly sought after as they could yield significant advances in drug discovery and design as well as protein function analysis. However, the traditional benchmark dataset for this task, Docking Benchmark 5 (DB5), contains only a paltry 230 complexes for training, validating, and testing different machine learning algorithms. In this work, we expand on a dataset recently introduced for this task, the Database of Interacting Protein Structures (DIPS), to present DIPS-Plus, an enhanced, feature-rich dataset of 42,112 complexes for geometric deep learning of protein interfaces. The previous version of DIPS contains only the Cartesian coordinates and types of the atoms comprising a given protein complex, whereas DIPS-Plus now includes a plethora of new residue-level features including protrusion indices, half-sphere amino acid compositions, and new profile hidden Markov model (HMM)-based sequence features for each amino acid, giving researchers a large, well-curated feature bank for training protein interface prediction methods.
翻译:蛋白质如何与其它一系列生物过程一起最终影响蛋白质的功能。 因此,蛋白质界面预测的精确计算方法(PIP)在发现和设计药物以及蛋白质功能分析方面可以取得显著进展,因此,人们在大量寻找精确的蛋白质界面预测计算方法(PIP ),但是,用于这项任务的传统基准数据集(Docking 基准5 (DB5)仅包含用于培训、验证和测试不同机器学习算法的230个微小复合体。 在这项工作中,我们扩展了最近为这项任务引入的一组数据,即Interacting Protein结构数据库(DIPS),以展示DIPS-Plus,一个由42,112个对蛋白质界面进行测深的复合体组成的42,112个强化的功能丰富的数据集。 DIPS的前版本只包含由给定蛋白质综合体组成的原子坐标和种类,而DIPS-Plus现在包含大量新的残留级特征,包括正积指数、半氨酸成成成像和新配置的Mark模型模型(HMMM-main-commain),为每个大型的基的蛋白质模型的模型的模型,为基质的模型的模型的基质序列。