State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision, creating the folklore that 16-bit precision alone is not enough to maximize model accuracy. As a result, deep learning accelerators are forced to support both 16-bit and 32-bit compute units which is more costly than only using 16-bit units for hardware design. We ask can we do pure 16-bit training which requires only 16-bit compute units, while still matching the model accuracy attained by 32-bit training. Towards this end, we study pure 16-bit training algorithms on the widely adopted BFloat16 compute unit. While these units conventionally use nearest rounding to cast output to 16-bit precision, we show that nearest rounding for model weight updates can often cancel small updates, which degrades the convergence and model accuracy. Motivated by this, we identify two simple existing techniques, stochastic rounding and Kahan summation, to remedy the model accuracy degradation in pure 16-bit training. We empirically show that these two techniques can enable up to 7% absolute validation accuracy gain in pure 16-bit training. This leads to 0.1% lower to 0.2% higher matching validation accuracy compared to 32-bit precision training across seven deep learning applications.
翻译:最先进的通用低精确度培训算法使用16比特和32比特的精度组合,创造了16比特精度单用16比特的精度不足以使模型精确度最大化的民俗,因此,深度学习加速器被迫支持16比特和32比特的计算器,其成本高于硬件设计使用16比特的计算器。我们问能否进行纯16比特的培训,它只需要16比特的计算器,同时仍然与32比特的培训所达到的模型精确度相符。为此,我们研究了广泛采用的BFloat16计算器上的纯16比特的培训算法。虽然这些单位通常使用最接近的四舍四舍五入来将输出抛出到16比特的精确度,但我们表明,最接近的模型重量更新可取消小的更新,从而降低趋同性和模型精确度。我们为此确定了两种简单的现有技术,即精度四舍五入和Kahan的计算法,以纠正模型精确度降解为32比特的精确度。我们从实验性地表明,这两种技术能够将精确度提高到7比特的精确度提高到16。