Despite being self-supervised, protein language models have shown remarkable performance in fundamental biological tasks such as predicting impact of genetic variation on protein structure and function. The effectiveness of these models on diverse set of tasks suggests that they learn meaningful representations of fitness landscape that can be useful for downstream clinical applications. Here, we interrogate the use of these language models in characterizing known pathogenic mutations in curated, medically actionable genes through an exhaustive search of putative compensatory mutations on each variant's genetic background. Systematic analysis of the predicted effects of these compensatory mutations reveal unappreciated structural features of proteins that are missed by other structure predictors like AlphaFold. While deep mutational scan experiments provide an unbiased estimate of the mutational landscape, we encourage the community to generate and curate rescue mutation experiments to inform the design of more sophisticated co-masking strategies and leverage large language models more effectively for downstream clinical prediction tasks.
翻译:尽管蛋白质语言模型是自我监督的,但在预测基因变异对蛋白质结构和功能的影响等基本生物任务方面表现出了显著的成绩。这些模型在一系列不同任务上的效力表明,它们学会了对下游临床应用有用的健身环境的有意义的描述。在这里,我们通过对每种变异的基因背景进行彻底的搜索,将已知的病原体突变定性为可医疗操作基因,从而对这些语言模型进行测试。对这些变异的预测效应进行系统分析,揭示出像阿尔法佛尔德这样的其他结构预测者所遗漏的蛋白质的不被人欣赏的结构特征。深入的突变扫描实验提供了对突变景观的不偏不倚估计,同时我们鼓励社区生成和整理突变实验,为设计更复杂的组合战略提供信息,并更有效地利用大型语言模型进行下游临床预测任务。