In this work we introduce RITA: a suite of autoregressive generative models for protein sequences, with up to 1.2 billion parameters, trained on over 280 million protein sequences belonging to the UniRef-100 database. Such generative models hold the promise of greatly accelerating protein design. We conduct the first systematic study of how capabilities evolve with model size for autoregressive transformers in the protein domain: we evaluate RITA models in next amino acid prediction, zero-shot fitness, and enzyme function prediction, showing benefits from increased scale. We release the RITA models openly, to the benefit of the research community.
翻译:在这项工作中,我们引入了RITA:一套蛋白序列自动递减基因模型,其参数可达12亿个,对属于UniRef-100数据库的2.8亿个蛋白序列进行了培训,这种基因模型有望大大加速蛋白设计。我们首次系统研究蛋白领域自动递减变压器的模型大小如何发展能力:我们在下一个氨酸预测、零射体健身和酶功能预测中评估RITA模型,显示扩大规模的好处。我们公开释放RITA模型,使研究界受益。