We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated. Inspired by old and well-established ideas in machine learning, we explore a variety of non-linear "reservoir" layers interspersed with regular transformer layers, and show improvements in wall-clock compute time until convergence, as well as overall performance, on various machine translation and (masked) language modelling tasks.
翻译:我们证明变压器的性能令人印象深刻,即使其中一些层是随机初始化的,从未更新过。 在机器学习中古老和既定的理念的启发下,我们探索了各种非线性“存储”层与常规变压器层交接,并展示了在各种机器翻译和(合成的)语言建模任务上,墙上时钟计算时间的改进,直至汇合,以及总体性能的改善。