How and where proteins interface with one another can ultimately impact the proteins' functions along with a range of other biological processes. As such, precise computational methods for protein interface prediction (PIP) come highly sought after as they could yield significant advances in drug discovery and design as well as protein function analysis. However, the traditional benchmark dataset for this task, Docking Benchmark 5 (DB5), contains only a modest 230 complexes for training, validating, and testing different machine learning algorithms. In this work, we expand on a dataset recently introduced for this task, the Database of Interacting Protein Structures (DIPS), to present DIPS-Plus, an enhanced, feature-rich dataset of 42,112 complexes for geometric deep learning of protein interfaces. The previous version of DIPS contains only the Cartesian coordinates and types of the atoms comprising a given protein complex, whereas DIPS-Plus now includes a plethora of new residue-level features including protrusion indices, half-sphere amino acid compositions, and new profile hidden Markov model (HMM)-based sequence features for each amino acid, giving researchers a large, well-curated feature bank for training protein interface prediction methods. We demonstrate through rigorous benchmarks that training an existing state-of-the-art (SOTA) model for PIP on DIPS-Plus yields SOTA results, surpassing the performance of all other models trained on residue-level and atom-level encodings of protein complexes to date.
翻译:蛋白质如何与其它一系列生物过程一起最终影响蛋白质的功能。 因此,对蛋白质界面预测(PIP)的精确计算方法进行了大量研究,因为它们可以在药物发现和设计以及蛋白质功能分析方面取得显著进展。 但是,用于这项任务的传统基准数据集,即Docking基准5(DB5)中只有230个小的复合材料,用于培训、验证和测试不同的机器学习算法。 在这项工作中,我们扩大了最近为这项任务引入的一套数据集,即Interacting Protein结构数据库(DIPS),以展示DIPS-Plus,一个强化的、富含地谱的数据集,由42,112个复杂的蛋白质界面组成。DIPS的前版本只包含由特定蛋白质综合体构成的卡斯特尔座坐标和种类,而DIPS-Plus现在包含大量新的残留级特征模型,包括:红蛋白质指数、半层酸成份,以及新配置的IMIP-Pl-Plu值模型(HM-TA级),一个强化的42级蛋白质界面模型,通过每个测试的高级模型,以展示现有磁性模型,以测试现有酸性模型,以测试的基质模型为每个的基级的基级的基数测序测算。