Building large AI fleets to support the rapidly growing DL workloads is an active research topic for modern cloud providers. Generating accurate benchmarks plays an essential role in designing the fast-paced software and hardware solutions in this space. Two fundamental challenges to make this scalable are (i) workload representativeness and (ii) the ability to quickly incorporate changes to the fleet into the benchmarks. To overcome these issues, we propose Mystique, an accurate and scalable framework for production AI benchmark generation. It leverages the PyTorch execution trace (ET), a new feature that captures the runtime information of AI models at the granularity of operators, in a graph format, together with their metadata. By sourcing fleet ETs, we can build AI benchmarks that are portable and representative. Mystique is scalable, due to its lightweight data collection, in terms of runtime overhead and instrumentation effort. It is also adaptive because ET composability allows flexible control on benchmark creation. We evaluate our methodology on several production AI models, and show that benchmarks generated with Mystique closely resemble original AI models, both in execution time and system-level metrics. We also showcase the portability of the generated benchmarks across platforms, and demonstrate several use cases enabled by the fine-grained composability of the execution trace.
 翻译:在现代云服务商中,建立大型 AI 车队以支持快速发展的 DL 工作量是一个活跃的研究课题。生成准确的基准在设计这个领域的快节奏软件和硬件解决方案中起着重要作用。使这个过程可扩展的两个基本挑战是 (i) 工作负载的代表性和 (ii) 快速将车队的变化纳入到基准中的能力。为了解决这些问题,我们提出了 Mystique,一个可准确可扩展的生产 AI 基准生成框架。它利用了 PyTorch 执行跟踪(ET),这是一种新的功能,以操作员的粒度格式捕捉 AI 模型的运行时信息及其元数据。通过获取车队 ET,我们可以构建可移植且代表性的 AI 基准。Mystique 可扩展,由于其轻量级的数据收集在运行时的开销和仪表化工作量方面。它也是适应性的,因为 ET 可组合性允许灵活控制基准创建。我们在多个生产 AI 模型上评估了我们的方法,并展示了由 Mystique 生成的基准在执行时间和系统级指标方面与原始 AI 模型非常相似。我们还展示了所生成基准在各平台的可移植性,并展示了由执行跟踪的精细组合性所支持的多个使用案例。