State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision, creating the folklore that 16-bit hardware compute units alone are not enough to maximize model accuracy. As a result, deep learning accelerators are forced to support both 16-bit and 32-bit floating-point units (FPUs), which is more costly than only using 16-bit FPUs for hardware design. We ask: can we train deep learning models only with 16-bit floating-point units, while still matching the model accuracy attained by 32-bit training? Towards this end, we study 16-bit-FPU training on the widely adopted BFloat16 unit. While these units conventionally use nearest rounding to cast output to 16-bit precision, we show that nearest rounding for model weight updates often cancels small updates, which degrades the convergence and model accuracy. Motivated by this, we study two simple techniques well-established in numerical analysis, stochastic rounding and Kahan summation, to remedy the model accuracy degradation in 16-bit-FPU training. We demonstrate that these two techniques can enable up to 7% absolute validation accuracy gain in 16-bit-FPU training. This leads to 0.1% lower to 0.2% higher validation accuracy compared to 32-bit training across seven deep learning applications.
翻译:高级通用低精确度培训算法使用16位比特和32位比特的精度混合法,创造了16位硬件计算单位单是16位比特和32位比特的民俗,不足以最大限度地提高模型精确度。结果,深学习加速器被迫支持16位比特和32位浮点单位(FPUs),这比仅仅使用16位FPU来设计硬件要昂贵得多。我们问:我们能否只用16位浮点单位来训练深学习模型,同时仍然与32位培训所达到的模型精确度相匹配?为此,我们在广泛采用的 BFloat16 单位中研究16位比特的FPU培训。虽然这些单位通常使用最接近的四舍五入四舍四舍四舍四入输出为16位精确度,但我们显示,最接近模型重量更新的四舍五入点往往会取消小的更新,从而降低趋同性和模型准确性。我们为此研究了两种简单技术,在数字分析、随机四舍和Kahan和合成技术,以16位比-FPU的精确度来补救模型精确度的精确度,在16位-FPUPU培训中比特的精确度上比特的精确度比为16。我们展示了这些技术,比的精确度比的精确度比试修修修比试了16。我们演示了这些技术可以让两种技术,比的精确度比试修修程比。我们的精确度比试修程比。