Pre-trained protein language models have demonstrated significant applicability in different protein engineering task. A general usage of these pre-trained transformer models latent representation is to use a mean pool across residue positions to reduce the feature dimensions to further downstream tasks such as predicting bio-physics properties or other functional behaviours. In this paper we provide a two-fold contribution to machine learning (ML) driven drug design. Firstly, we demonstrate the power of sparsity by promoting penalization of pre-trained transformer models to secure more robust and accurate melting temperature (Tm) prediction of single-chain variable fragments with a mean absolute error of 0.23C. Secondly, we demonstrate the power of framing our prediction problem in a probabilistic framework. Specifically, we advocate for the need of adopting probabilistic frameworks especially in the context of ML driven drug design.
翻译:未经培训的蛋白质语言模型在不同的蛋白质工程任务中表现出了显著的实用性。这些经过培训的变压器模型的潜在代表作用是使用一个平均的组合,在残留位置上减少特征层面,以推进下游任务,如预测生物物理特性或其他功能行为。在本文件中,我们为机器学习(ML)驱动的药物设计提供了双重贡献。首先,我们通过促进对经过培训的变压器模型进行处罚,确保以0.23C的平均绝对误差对单链变量碎片进行更稳健、更准确的熔化温度预测,来展示聚变压力。第二,我们展示了在概率框架内构建预测问题的能力。具体地说,我们主张需要采用概率框架,特别是在由ML驱动的药物设计方面。