As opposed to scaling-up protein language models (PLMs), we seek improving performance via protein-specific optimization. Although the proportionality between the language model size and the richness of its learned representations is validated, we prioritize accessibility and pursue a path of data-efficient, cost-reduced, and knowledge-guided optimization. Through over twenty experiments ranging from masking, architecture, and pre-training data, we derive insights from protein-specific experimentation into building a model that interprets the language of life, optimally. We present Ankh, the first general-purpose PLM trained on Google's TPU-v4 surpassing the state-of-the-art performance with fewer parameters (<10% for pre-training, <7% for inference, and <30% for the embedding dimension). We provide a representative range of structure and function benchmarks where Ankh excels. We further provide a protein variant generation analysis on High-N and One-N input data scales where Ankh succeeds in learning protein evolutionary conservation-mutation trends and introducing functional diversity while retaining key structural-functional characteristics. We dedicate our work to promoting accessibility to research innovation via attainable resources.
翻译:与扩大蛋白质语言模型(PLM)相比,我们寻求通过蛋白质特定优化来改善表现。虽然语言模型大小与其丰富学习表现的丰富程度之间的相称性得到了验证,但我们优先考虑无障碍,并追求数据效率高、成本降低和知识引导优化的途径。通过蒙面、建筑和训练前数据等20多项实验,我们从蛋白质特定实验中得出一些见解,以构建一个能最优化地解释生命语言的模型。我们介绍了安赫,这是在谷歌TPU-v4上培训的首个通用PLM,它以较少的参数(培训前参数 < 10%,推断参数 < 7%,嵌入层面 < 30%)超越了最新表现的通用PLPM,我们致力于通过可实现的资源促进研究创新的无障碍性。