The rapid scaling of language models is motivating research using low-bitwidth quantization. In this work, we propose a novel binarization technique for Transformers applied to machine translation (BMT), the first of its kind. We identify and address the problem of inflated dot-product variance when using one-bit weights and activations. Specifically, BMT leverages additional LayerNorms and residual connections to improve binarization quality. Experiments on the WMT dataset show that a one-bit weight-only Transformer can achieve the same quality as a float one, while being 16x smaller in size. One-bit activations incur varying degrees of quality drop, but mitigated by the proposed architectural changes. We further conduct a scaling law study using production-scale translation datasets, which shows that one-bit weight Transformers scale and generalize well in both in-domain and out-of-domain settings. Implementation in JAX/Flax will be open sourced.
翻译:语言模型的快速缩放正在推动使用低比位分化的研究。 在这项工作中,我们提议了一种新型的变异器二元化技术,用于机器翻译(BMT),这是第一个这类技术。我们在使用一比特重量和激活时发现并解决了高涨的点产品差异问题。具体地说,BMT利用额外的图层元和剩余连接来提高二进制质量。WMT数据集的实验显示,只有一比位重量的变异器可以达到与浮标一样的质量,而其大小则小16x。一比值的变异器质量下降程度不同,但因拟议的建筑变化而有所减缓。我们进一步利用生产规模的变异数据集进行一项规模法律研究,这表明一比重变异器的规模和一般化在部内外环境都很好。在JAX/Flax的应用将开放源。