Large-scale pretraining instills large amounts of knowledge in deep neural networks. This, in turn, improves the generalization behavior of these models in downstream tasks. What exactly are the limits to the generalization benefits of large-scale pretraining? Here, we report observations from some simple experiments aimed at addressing this question in the context of two semantic parsing tasks involving natural language, SCAN and COGS. We show that language models pretrained exclusively with non-English corpora, or even with programming language corpora, significantly improve out-of-distribution generalization in these benchmarks, compared with models trained from scratch, even though both benchmarks are English-based. This demonstrates the surprisingly broad transferability of pretrained representations and knowledge. Pretraining with a large-scale protein sequence prediction task, on the other hand, mostly deteriorates the generalization performance in SCAN and COGS, suggesting that pretrained representations do not transfer universally and that there are constraints on the similarity between the pretraining and downstream domains for successful transfer. Finally, we show that larger models are harder to train from scratch and their generalization accuracy is lower when trained up to convergence on the relatively small SCAN and COGS datasets, but the benefits of large-scale pretraining become much clearer with larger models.
翻译:在深层神经网络中,大规模培训前的大规模培训将积累大量知识。这反过来又能改善这些模型在下游任务中的一般化行为。对于大规模培训前的普及好处来说,究竟什么是限制?在这里,我们报告一些简单的实验的观察结果,这些实验的目的是在涉及自然语言、SCAN和COGS的两种语义分解任务中解决这一问题。我们表明,语言模型完全以非英语公司,甚至以编程语言公司进行训练,大大改进了这些基准在分配上的概括化,而从零到零的模型,尽管这两种基准都是以英语为基础的。这显示了预先培训的展示和知识具有惊人的广泛可转让性。另一方面,在进行大规模蛋白序列预测任务的培训前,多数会恶化SCAN和COGS的概括性表现,表明预先培训前的表述并不普遍转移,在培训前和下游领域之间对成功转让的相似性存在制约。最后,我们表明,较大型模型从零到培训后普遍化的精确性比较困难,而当经过培训后,在大规模培训前的模型上更趋同较明显地合并时,则比较低。