Pipeline parallelism has achieved great success in deploying large-scale transformer models in cloud environments, but has received less attention in edge environments. Unlike in cloud scenarios with high-speed and stable network interconnects, dynamic bandwidth in edge systems can degrade distributed pipeline performance. We address this issue with QuantPipe, a communication-efficient distributed edge system that introduces post-training quantization (PTQ) to compress the communicated tensors. QuantPipe uses adaptive PTQ to change bitwidths in response to bandwidth dynamics, maintaining transformer pipeline performance while incurring limited inference accuracy loss. We further improve the accuracy with a directed-search analytical clipping for integer quantization method (DS-ACIQ), which bridges the gap between estimated and real data distributions. Experimental results show that QuantPipe adapts to dynamic bandwidth to maintain pipeline performance while achieving a practical model accuracy using a wide range of quantization bitwidths, e.g., improving accuracy under 2-bit quantization by 15.85\% on ImageNet compared to naive quantization.
翻译:在云层环境中部署大型变压器模型取得了巨大的成功,但在边缘环境中受到的关注较少。与高速和稳定的网络互连的云形情景不同,边缘系统中的动态带宽可以降低分布式输油管性能。我们与QuatPipe一起解决这个问题,QuatPipe是一个具有通信效率的分布式边缘系统,它引入了培训后定量(PTQ)以压缩传送的气压。QuatPipe使用适应性的PTQ来改变带宽动态的位宽,保持变压管性能,同时造成有限的推力准确性损失。我们进一步提高了精度,通过定向搜索分析截断整数法(DS-ACIQ),从而弥合了估计分布与真实数据分布之间的差距。实验结果表明,QuatPipe适应动态带宽以保持输油管性能,同时利用广泛的四分位位位化位宽度来达到实用的模型精度,例如,在图像网络上通过15.85英寸的平方位平方位的图像网下提高精度。