Proteins perform critical processes in all living systems: converting solar energy into chemical energy, replicating DNA, as the basis of highly performant materials, sensing and much more. While an incredible range of functionality has been sampled in nature, it accounts for a tiny fraction of the possible protein universe. If we could tap into this pool of unexplored protein structures, we could search for novel proteins with useful properties that we could apply to tackle the environmental and medical challenges facing humanity. This is the purpose of protein design. Sequence design is an important aspect of protein design, and many successful methods to do this have been developed. Recently, deep-learning methods that frame it as a classification problem have emerged as a powerful approach. Beyond their reported improvement in performance, their primary advantage over physics-based methods is that the computational burden is shifted from the user to the developers, thereby increasing accessibility to the design method. Despite this trend, the tools for assessment and comparison of such models remain quite generic. The goal of this paper is to both address the timely problem of evaluation and to shine a spotlight, within the Machine Learning community, on specific assessment criteria that will accelerate impact. We present a carefully curated benchmark set of proteins and propose a number of standard tests to assess the performance of deep learning based methods. Our robust benchmark provides biological insight into the behaviour of design methods, which is essential for evaluating their performance and utility. We compare five existing models with two novel models for sequence prediction. Finally, we test the designs produced by these models with AlphaFold2, a state-of-the-art structure-prediction algorithm, to determine if they are likely to fold into the intended 3D shapes.
翻译:蛋白质在所有活体系统中运行关键过程:将太阳能转化为化学能源,复制DNA,以此作为高性能材料、感知和更多材料的基础。虽然在性质上对一系列功能进行了令人难以置信的取样,但它占了可能的蛋白质宇宙的一小部分。如果我们能够利用这个未探索的蛋白结构库,我们可以寻找具有有用特性的新蛋白质,我们可以应用这些特性来应对人类面临的环境和医疗挑战。这是蛋白质设计的目的。序列设计是蛋白质设计的一个重要方面,并且已经开发出许多成功的方法。最近,作为高度性能材料、感知和更多更多材料基础的DNA设计问题框架的深层次学习方法已经作为一种强有力的方法出现。除了它们所报告的性能改进之外,它们对物理法方法的主要优势是计算负担从用户转移到开发者,从而增加设计方法的可使用性能。尽管这种趋势,但评估和比较这些模型的工具仍然非常普通。 本文的目的是解决及时的折折叠问题,并在机器学习界内部,通过具体的评估模型来确定一个要设计的分类问题。除了它们打算的分类方法外,它们还能够确定一种快速的精确的精确的估算模型,我们用来评估一种精确的精确的精确的精确性评估方法。 我们用一种精确的精确的精确的精确的计算方法来评估方法来评估。