GPU technology has been improving at an expedited pace in terms of size and performance, empowering HPC and AI/ML researchers to advance the scientific discovery process. However, this also leads to inefficient resource usage, as most GPU workloads, including complicated AI/ML models, are not able to utilize the GPU resources to their fullest extent. We propose MISO, a technique to exploit the Multi-Instance GPU (MIG) capability of NVIDIA A100 GPUs to dynamically partition GPU resources among co-located jobs. MISO's key insight is to use the lightweight, more flexible Multi-Process Service (MPS) capability to predict the best MIG partition allocation for different jobs, without incurring the overhead of implementing them during exploration. Due to its ability to utilize GPU resources more efficiently, MISO achieves 49% and 16% lower average job completion time than the unpartitioned and optimal static GPU partition schemes, respectively.
翻译:在规模和绩效方面,GPU技术正在加速改善,赋予HPC和AI/ML研究人员推动科学发现过程的权力,但是,这也导致资源使用效率低下,因为大多数GPU工作量,包括复杂的AI/ML模型,无法最充分地利用GPU资源。我们提议MISO,这是利用NVIDIA A100 GPU的多参与GPU(MIG)能力,将GPU资源动态地分配给合用职位的一种技术。MISO的主要见解是利用轻量、更灵活的多处理服务(MPS)能力预测不同职位的最佳MIG分配,而没有承担勘探期间执行这些资源的间接费用。由于MISO有能力更有效地利用GPU资源,平均完成工作的时间分别比未参与和优化的静止GPU分区计划低49%和16%。