Antibodies are vital proteins offering robust protection for the human body from pathogens. The development of general protein and antibody-specific pre-trained language models both facilitate antibody prediction tasks. However, few studies comprehensively explore the representation capability of distinct pre-trained language models on different antibody problems. Here, to investigate the problem, we aim to answer the following key questions: (1) How do pre-trained language models perform in antibody tasks with different specificity? (2) How many benefits will the model gain if we introduce the specific biological mechanism to the pre-training process? (3) Do the learned antibody pre-trained representations make sense in real-world antibody problems, like drug discovery and immune process understanding? Previously, no benchmark available largely hindered the study to answer these questions. To facilitate the investigation, we provide an AnTibody Understanding Evaluation (ATUE) benchmark. We comprehensively evaluate the performance of protein pre-trained language models by empirical study along with conclusions and new insights. Our ATUE and code are released at https://github.com/dqwang122/EATLM.
翻译:开发通用蛋白质和抗体专用预先培训的语言模型,都有助于反体预测任务。然而,很少有研究全面探讨不同抗体问题不同的经培训语言模型的代表性能力。为了调查问题,我们的目标是回答以下关键问题:(1) 预先培训语言模型如何以不同的特殊性执行抗体任务?(2) 如果我们将特定生物机制引入培训前进程,模型将获得多少好处?(3) 学到的抗体预先培训的表述在现实世界的抗体问题中是否有意义,如药物发现和免疫过程理解?以前,没有基准可以在很大程度上阻碍对这些问题的回答。为便利调查,我们提供了“Antibody理解评估”基准。我们通过实验研究以及结论和新洞察,全面评估经过培训的蛋白类预先语言模型的性能。我们的ATUE和代码在https://github.com/dqwang122/EATLM上发布。