New architecture GPUs like A100 are now equipped with multi-instance GPU (MIG) technology, which allows the GPU to be partitioned into multiple small, isolated instances. This technology provides more flexibility for users to support both deep learning training and inference workloads, but efficiently utilizing it can still be challenging. The vision of this paper is to provide a more comprehensive and practical benchmark study for MIG in order to eliminate the need for tedious manual benchmarking and tuning efforts. To achieve this vision, the paper presents MIGPerf, an open-source tool that streamlines the benchmark study for MIG. Using MIGPerf, the authors conduct a series of experiments, including deep learning training and inference characterization on MIG, GPU sharing characterization, and framework compatibility with MIG. The results of these experiments provide new insights and guidance for users to effectively employ MIG, and lay the foundation for further research on the orchestration of hybrid training and inference workloads on MIGs. The code and results are released on https://github.com/MLSysOps/MIGProfiler. This work is still in progress and more results will be published soon.
翻译:A100等新架构的GPU目前配备了多因子GPU(MIG)技术,使GPU可以分成多个小孤立实例,这种技术为用户提供了更大的灵活性,既支持深层次的学习培训和推断工作量,又支持深层次的学习培训和推断工作量,但有效地加以利用仍然具有挑战性。本文件的愿景是为MIG提供更加全面、实用的基准研究,以便消除对烦琐的手册基准制定和调整工作的需求。为了实现这一愿景,本文件提供了MIGPerf,这是一个开放源码工具,简化了MIG的基准研究。使用MIGPerf,作者进行了一系列实验,包括对MIG、GP共享特性和框架兼容性进行深层次的学习培训和推断描述。这些实验的结果为用户有效地运用MIG提供了新的洞察和指导,并为对混合培训的调控和对MIGs的推断工作量进行进一步研究奠定了基础。该代码和结果将在https://github.com/MLSyOps/MIGProfiler上发布。这项工作将很快取得更多的进展和成果。