Recent developments in Natural Language Processing (NLP) demonstrate that large-scale, self-supervised pre-training can be extremely beneficial for downstream tasks. These ideas have been adapted to other domains, including the analysis of the amino acid sequences of proteins. However, to date most attempts on protein sequences rely on direct masked language model style pre-training. In this work, we design a new, adversarial pre-training method for proteins, extending and specializing similar advances in NLP. We show compelling results in comparison to traditional MLM pre-training, though further development is needed to ensure the gains are worth the significant computational cost.
翻译:最近自然语言处理(NLP)的发展表明,大规模、自我监督的培训前培训对于下游任务极为有益,这些想法已经适应其他领域,包括分析蛋白质的氨基酸序列,但迄今为止,蛋白质序列的尝试大多依靠直接的蒙面语言模式培训前培训,在这项工作中,我们设计了一种新的对抗性蛋白学前培训方法,扩大和专门化了NLP的类似进展。我们与传统的MLM培训前培训相比,我们显示出令人信服的结果,尽管需要进一步发展,以确保收益值得大量计算成本。