Solid results from Transformers have made them prevailing architectures in various natural language and vision tasks. As a default component in Transformers, Layer Normalization (LN) normalizes activations within each token to boost the robustness. However, LN requires on-the-fly statistics calculation in inference as well as division and square root operations, leading to inefficiency on hardware. What is more, replacing LN with other hardware-efficient normalization schemes (e.g., Batch Normalization) results in inferior performance, even collapse in training. We find that this dilemma is caused by abnormal behaviors of activation statistics, including large fluctuations over iterations and extreme outliers across layers. To tackle these issues, we propose Unified Normalization (UN), which can speed up the inference by being fused with other linear operations and achieve comparable performance on par with LN. UN strives to boost performance by calibrating the activation and gradient statistics with a tailored fluctuation smoothing strategy. Meanwhile, an adaptive outlier filtration strategy is applied to avoid collapse in training whose effectiveness is theoretically proved and experimentally verified in this paper. We demonstrate that UN can be an efficient drop-in alternative to LN by conducting extensive experiments on language and vision tasks. Besides, we evaluate the efficiency of our method on GPU. Transformers equipped with UN enjoy about 31% inference speedup and nearly 18% memory reduction. Code will be released at https://github.com/hikvision-research/Unified-Normalization.
翻译:变异器的可靠结果使它们在各种自然语言和视觉任务中占据了主导结构。 作为变异器的默认组成部分, 层的正常化( LN) 将每个象征的启动程序标准化, 以提升稳健性。 然而, LN 需要实时统计计算推论以及分数和平方根操作, 从而导致硬件效率低下。 更重要的是, 以其他硬件高效的正常化计划( 例如, 批量正常化) 取代LN 会导致低效, 甚至培训的崩溃。 我们发现这种两难困境是由激活统计数据的异常行为造成的, 包括迭代和极异端的异常波动。 为了解决这些问题, 我们提议统一常态化(UN), 它可以与其他线性操作结合, 并实现与LN 的相似性能。 UN 努力通过定制的波动平滑动策略校准启动和梯度统计来提升业绩。 同时, 应用适应性更敏化的过滤战略来避免培训的崩溃, 其有效性在理论上得到验证和实验性地验证。 为了解决这些问题, 我们证明联合国能够快速地进行 GLOV 。