Predicting a structure of an antibody from its sequence is important since it allows for a better design process of synthetic antibodies that play a vital role in the health industry. Most of the structure of an antibody is conservative. The most variable and hard-to-predict part is the third complementarity-determining region of the antibody heavy chain (CDR H3). Lately, deep learning has been employed to solve the task of CDR H3 prediction. However, current state-of-the-art methods are not end-to-end, but rather they output inter-residue distances and orientations to the RosettaAntibody package that uses this additional information alongside statistical and physics-based methods to predict the 3D structure. This does not allow a fast screening process and, therefore, inhibits the development of targeted synthetic antibodies. In this work, we present an end-to-end model to predict CDR H3 loop structure, that performs on par with state-of-the-art methods in terms of accuracy but an order of magnitude faster. We also raise an issue with a commonly used RosettaAntibody benchmark that leads to data leaks, i.e., the presence of identical sequences in the train and test datasets.
翻译:从序列中预测抗体的结构很重要,因为它允许对在卫生行业中发挥重要作用的合成抗体进行更好的设计过程。抗体的结构大多是保守的。最易变和难以预测的部分是抗体重链(CDR H3)的第三个互补确定区。在这项工作中,我们采用了一个端到端的模型来预测CDR H3环形结构的任务。然而,目前最先进的方法不是端到端,而是给RosettaAntibody软件包提供间隔和方向,该软件包使用这种附加信息,同时使用统计和物理方法预测3D结构。这不允许快速筛选过程,因此抑制了定向合成抗体的发展。在这项工作中,我们提出了一个端到端到端的模型来预测CDR H3环形结构,在准确性方面与最先进的方法相当,但质量却更快。我们还提出了一个问题,即经常使用RosettaAntibody基准来使用这种额外信息来预测3D结构,从而导致数据输出相同的测试序列。i。