We investigate the training of sparse layers that use different parameters for different inputs based on hashing in large Transformer models. Specifically, we modify the feedforward layer to hash to different sets of weights depending on the current token, over all tokens in the sequence. We show that this procedure either outperforms or is competitive with learning-to-route mixture-of-expert methods such as Switch Transformers and BASE Layers, while requiring no routing parameters or extra terms in the objective function such as a load balancing loss, and no sophisticated assignment algorithm. We study the performance of different hashing techniques, hash sizes and input features, and show that balanced and random hashes focused on the most local features work best, compared to either learning clusters or using longer-range context. We show our approach works well both on large language modeling and dialogue tasks, and on downstream fine-tuning tasks.
翻译:我们在大型变换器模型中根据散列方式对不同投入使用不同参数的稀疏层进行培训。 具体地说, 我们修改进料层, 使进料层变成根据当前符号的不同重量组, 取决于序列中所有符号的重量组。 我们显示,这个程序要么优于或优于诸如开关变换器和BASE图层等从学习到路径的混合专家方法, 而同时在目标功能中不要求路线参数或额外条件, 如负载平衡损失, 没有复杂的分配算法 。 我们研究不同的散列技术、 散列大小和输入特性的性能, 并显示平衡和随机地掌握着最适合本地特征的工作, 而不是学习集群或使用远程环境。 我们展示了我们的方法在大型语言建模和对话任务以及下游微调任务上都很有效 。