数千个通用计算机用户多租机学习服务的模拟平台 (A Simulation Platform for Multi-tenant Machine Learning Services on Thousands of GPUs)

Multi-tenant machine learning services have become emerging data-intensive workloads in data centers with heavy usage of GPU resources. Due to the large scale, many tuning parameters and heavy resource usage, it is usually impractical to evaluate and benchmark those machine learning services on real clusters. In this demonstration, we present AnalySIM, a cluster simulator that allows efficient design explorations for multi-tenant machine learning services. Specifically, by trace-driven cluster workload simulation, AnalySIM can easily test and analyze various scheduling policies in a number of performance metrics such as GPU resource utilization. AnalySIM simulates the cluster computational resource based on both physical topology and logical partition. The tool has been used in SenseTime to understand the impact of different scheduling policies with the trace from a real production cluster of over 1000 GPUs. We find that preemption and migration are able to significantly reduce average job completion time and mitigate the resource fragmentation problem.

翻译：多租赁机学习服务已成为大量使用GPU资源的数据中心中新出现的数据密集型工作量。由于规模庞大,许多调试参数和大量资源使用,通常不切实际,无法按照实际组群评价和基准评估这些机器学习服务。在本次演示中,我们展示了AnalySIM, 这是一个集束模拟器,可以高效设计多租赁机学习服务的设计探索。具体地说,通过追踪驱动的集群工作量模拟,AnalySIM可以很容易地测试和分析诸如GPU资源利用等若干性能衡量标准中的各种时间安排政策。分析SIM模拟基于物理地形学和逻辑分布的集群计算资源。SenseTime使用了该工具来理解从1 000多个实际生产组群集中追踪到的不同时间安排政策的影响。我们发现,先发制人和迁移能够大大缩短平均完成工作的时间,减轻资源分散问题。

相关内容

Machine Learning

关注 0

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

专知会员服务

39+阅读 · 2020年11月3日