Transformers have attained superior performance in natural language processing and computer vision. Their self-attention and feedforward layers are overparameterized, limiting inference speed and energy efficiency. Tensor decomposition is a promising technique to reduce parameter redundancy by leveraging tensor algebraic properties to express the parameters in a factorized form. Prior efforts used manual or heuristic factorization settings without hardware-aware customization, resulting in poor hardware efficiencies and large performance degradation. In this work, we propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions and automates the choice of tensorization shape and decomposition rank with hardware-aware co-optimization. We jointly investigate tensor contraction path optimizations and a fused Einsum mapping strategy to bridge the gap between theoretical benefits and real hardware efficiency improvement. Our two-stage knowledge distillation flow resolves the trainability bottleneck and thus significantly boosts the final accuracy of factorized Transformers. Overall, we experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss and achieve a better efficiency-accuracy Pareto frontier than hand-tuned and heuristic baselines.
翻译:在自然语言处理和计算机视觉方面,变异器已经取得了优异的性能。他们的自我偏好和进食向向向向变异层被过度测量,限制发酵速度和能源效率。 电锯分解是一种很有希望的技术,通过利用高温代数特性来表达参数,减少参数冗余。 我们共同调查高压收缩路径优化和混合的埃辛平图绘制战略,以弥合理论效益与实际硬件效率改进之间的差距。 在这项工作中,我们提出的两阶段的蒸馏知识流解决了易燃瓶颈问题,从而大大提升了因数变异器的最终精度。总体而言,我们实验性地表明,我们硬件-觉变异形状的指数空间和变异形状的选择与硬件-自觉共同优化的分解位置相等,是一种很有希望的方法。我们联合调查了高压收缩路径优化和混合的埃辛图绘制战略,以弥合理论效益与实际硬件效率改进之间的鸿沟。我们两个阶段的蒸馏流解决了易燃性瓶,从而大大提升了因子变异的终精度。总体,我们实验性地表明,我们的硬件-觉-认失能率率比标准比标准标准化的损率降低了5.-损率率。