Figuring out small molecule binding sites in target proteins, in the resolution of either pocket or residue, is critical in many virtual and real drug-discovery scenarios. Since it is not always easy to find such binding sites based on domain knowledge or traditional methods, different deep learning methods that predict binding sites out of protein structures have been developed in recent years. Here we present a new such deep learning algorithm, that significantly outperformed all state-of-the-art baselines in terms of the both resolutions$\unicode{x2013}$pocket and residue. This good performance was also demonstrated in a case study involving the protein human serum albumin and its binding sites. Our algorithm included new ideas both in the model architecture and in the training method. For the model architecture, it incorporated SE(3)-invariant geometric self-attention layers that operate on top of residue-level CNN outputs. This residue-level processing of the model allowed a transfer learning between the two resolutions, which turned out to significantly improve the binding pocket prediction. Moreover, we developed novel augmentation method based on protein homology, which prevented our model from over-fitting. Overall, we believe that our contribution to the literature is twofold. First, we provided a new computational method for binding site prediction that is relevant to real-world applications, as shown by the good performance on different benchmarks and case study. Second, the novel ideas in our method$\unicode{x2013}$the model architecture, transfer learning and the homology augmentation$\unicode{x2013}$would serve as useful components in future works.
翻译:使用SE(3)不变转换器,迁移学习和基于同源性增强来增强卷积神经网络的蛋白质结合位点预测能力
找出靶蛋白质中的小分子结合位点(无论是口袋还是残基分辨率),在许多虚拟和真实的药物研发场景中至关重要。由于根据领域知识或传统方法寻找这些结合位点并不总是容易的,因此近年来发展了各种深度学习方法,通过蛋白质结构预测结合位点。在本文中,我们提出了一种新的这类深度学习算法,它在口袋和残基这两种分辨率下的表现都显著优于所有最先进的基准线。这种良好的性能在涉及人血清白蛋白及其结合位点的案例研究中也得到了证明。我们的算法包含了模型架构和训练方法方面的新思想。对于模型架构,我们采用了SE(3)不变的几何自注意力层,它们在残基级CNN输出之上操作。这种残基级处理方式使得迁移学习在这两种分辨率之间成为可能,这实际上显著提高了结合口袋的预测准确性。此外,我们还开发了基于蛋白质同源性的新型增强方法,防止模型过拟合。总体上,我们认为我们对文献的贡献是双重的。首先,我们提供了一种新的,与实际应用相关的结合位点预测的计算方法,如不同基准测试和案例研究中所示的表现良好。其次,我们的方法中的创新思想——模型架构、迁移学习和同源性增强——将成为未来工作的有用组成部分。