Deep Learning (DL) models have achieved superior performance. Meanwhile, computing hardware like NVIDIA GPUs also demonstrated strong computing scaling trends with 2x throughput and memory bandwidth for each generation. With such strong computing scaling of GPUs, multi-tenant deep learning inference by co-locating multiple DL models onto the same GPU becomes widely deployed to improve resource utilization, enhance serving throughput, reduce energy cost, etc. However, achieving efficient multi-tenant DL inference is challenging which requires thorough full-stack system optimization. This survey aims to summarize and categorize the emerging challenges and optimization opportunities for multi-tenant DL inference on GPU. By overviewing the entire optimization stack, summarizing the multi-tenant computing innovations, and elaborating the recent technological advances, we hope that this survey could shed light on new optimization perspectives and motivate novel works in future large-scale DL system optimization.
翻译:同时,像 NVIDIA GPUs 这样的计算硬件也显示了强大的计算比例趋势,每代的计算量和内存带宽为2x 吞吐量和内存带宽。由于GPU的计算规模如此之大,将多个DL模型合用在同一GPU上,从而广泛运用了多耐性深度学习推论,以改进资源利用,提高服务吞吐量,降低能源成本等。然而,实现高效的多耐性DL推论具有挑战性,这需要彻底的全套系统优化。本次调查的目的是总结和分类在GPU上多耐性 DL推论方面新出现的挑战和最佳机会。通过概述整个优化堆,总结多耐性计算创新,并阐述最近的技术进步,我们希望这项调查能够揭示新的优化观点,并激励未来大型DL系统优化中的新工程。