Machine Learning-guided solutions for protein learning tasks have made significant headway in recent years. However, success in scientific discovery tasks is limited by the accessibility of well-defined and labeled in-domain data. To tackle the low-data constraint, recent adaptions of deep learning models pretrained on millions of protein sequences have shown promise; however, the construction of such domain-specific large-scale model is computationally expensive. Here, we propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework in which we reprogram deep models for alternate-domain tasks that can perform well on protein property prediction with significantly fewer training samples. R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences, by learning a sparse linear mapping between English and protein sequence vocabulary embeddings. Our model can attain better accuracy and significantly improve the data efficiency by up to $10^5$ times over the baselines set by pretrained and standard supervised methods. To this end, we reprogram an off-the-shelf pre-trained English language transformer and benchmark it on a set of protein physicochemical prediction tasks (secondary structure, stability, homology, stability) as well as on a biomedically relevant set of protein function prediction tasks (antimicrobial, toxicity, antibody affinity).
翻译:近些年来,蛋白质学习任务方面的学习-机器引导型解决方案取得了显著进展。然而,科学发现任务的成功因获得定义明确且贴标签的内部数据而受到限制。为了解决低数据限制问题,最近根据数百万蛋白序列培训的深层学习模式的调整显示了希望;然而,建造这种特定领域的大规模模型在计算上成本很高。在这里,我们提议通过字典学习(R2DL)进行代表学习,这是一个端到端的代表学习框架,在这个框架中,我们重新规划了能够以少得多的培训样本很好地进行蛋白质财产预测的替代工作的深层模型。R2DL重新编程了一个经过预先训练的英语模型,以学习蛋白序列的嵌入。通过学习英国和蛋白序列词汇嵌入的稀疏线性绘图,我们的模式可以提高准确性,并大大提高数据效率,比预先培训和标准监督的方法设定的基线高出10%5美元。为了这个目的,我们为此重新编程一个经过训练的、经过训练的英语变异的替代语言的深层模型,将它作为一个具有稳定性的血液稳定性的生理毒性的模型的模型,并且设定一个稳定性预测,我们可以提高数据效率。