魔鬼在细节中: 简单变异器的系统化化改进变异器的系统化 (The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers)

Recently, many datasets have been proposed to test the systematic generalization ability of neural networks. The companion baseline Transformers, typically trained with default hyper-parameters from standard tasks, are shown to fail dramatically. Here we demonstrate that by revisiting model configurations as basic as scaling of embeddings, early stopping, relative positional embedding, and Universal Transformer variants, we can drastically improve the performance of Transformers on systematic generalization. We report improvements on five popular datasets: SCAN, CFQ, PCFG, COGS, and Mathematics dataset. Our models improve accuracy from 50% to 85% on the PCFG productivity split, and from 35% to 81% on COGS. On SCAN, relative positional embedding largely mitigates the EOS decision problem (Newman et al., 2020), yielding 100% accuracy on the length split with a cutoff at 26. Importantly, performance differences between these models are typically invisible on the IID data split. This calls for proper generalization validation sets for developing neural networks that generalize systematically. We publicly release the code to reproduce our results.

翻译：最近,提出了许多数据集,以测试神经网络的系统普及能力。相伴的基线变异器, 通常在标准任务中经过默认超参数训练, 显示其显著失败。我们在这里证明, 通过重新审视嵌入规模、早期停止、相对位置嵌入和通用变异等基本模型配置, 我们可以大幅提高变异器系统化概括化的性能。我们报告五个流行数据集的改进情况: SCAN、 CFQ、 PCFG、 COGS 和数学数据集。我们的模型提高了PCFG生产率分布的精度从50%提高到85%, COGS 的精度从35%提高到81 % 。在 SCAN, 相对定位嵌入在很大程度上缓解了 EOS 决策问题( Newman等人, 2020), 以26时的截断点来产生100%的精度, 这些模型的性能差异一般在 IID 数据分割时是看不见的。这要求为系统化开发神经网络建立适当的普及化验证组。我们公开发布代码以复制我们的结果。