Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine learning, but have also grown in parameters and computational complexity, making them increasingly difficult to deploy in resource-constrained environments. Binarization of the weights and activations of the network can significantly alleviate these issues, however, is technically challenging from an optimization perspective. In this work, we identify a series of improvements that enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a method to quantize a network to its limit by successively distilling higher precision models into lower precision students. These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9%. Code and models are available at: https://github.com/facebookresearch/bit.
翻译:现代培训前变压器在机器学习中迅速提升了最先进的变压器,但也提高了参数和计算复杂性,使其越来越难以在资源受限制的环境中部署。 然而,从优化的角度看,对网络的重量和激活进行计数可以从技术上大大缓解这些问题具有挑战性。 在这项工作中,我们确定了一系列改进措施,使二进制变压器的精度大大高于以前可能达到的水平。其中包括两套二进制计划、具有新颖的弹性二进制激活功能,以及通过连续将更精密的模型蒸馏到低精度学生中,对网络加以限制的方法进行量化。这些办法首次允许将完全二进化的变压器模型在实际精确水平上,接近GLUE语言理解基准的完全精度BERT基线,但不超过5.9%。代码和模型可在以下网址查阅:https://github.com/facebookreasearch/bit。