Artificial intelligence (AI) and Machine learning (ML) workloads are an increasingly larger share of the compute workloads in traditional High-Performance Computing (HPC) centers and commercial cloud systems. This has led to changes in deployment approaches of HPC clusters and the commercial cloud, as well as a new focus on approaches to optimized resource usage, allocations and deployment of new AI frame- works, and capabilities such as Jupyter notebooks to enable rapid prototyping and deployment. With these changes, there is a need to better understand cluster/datacenter operations with the goal of developing improved scheduling policies, identifying inefficiencies in resource utilization, energy/power consumption, failure prediction, and identifying policy violations. In this paper we introduce the MIT Supercloud Dataset which aims to foster innovative AI/ML approaches to the analysis of large scale HPC and datacenter/cloud operations. We provide detailed monitoring logs from the MIT Supercloud system, which include CPU and GPU usage by jobs, memory usage, file system logs, and physical monitoring data. This paper discusses the details of the dataset, collection methodology, data availability, and discusses potential challenge problems being developed using this data. Datasets and future challenge announcements will be available via https://dcc.mit.edu.
翻译:人工智能(AI)和机器学习(ML)工作量在计算传统高性能计算中心和商业云层系统中的工作量中所占的份额越来越大,这导致高效电聚集的部署方法和商业云层的部署方法发生变化,以及新的侧重点,即优化资源使用、分配和部署新的人工智能框架工程的方法,以及诸如Jupyter笔记本等能力,以便能够迅速进行原型设计和部署。随着这些变化,需要更好地理解集群/数据中心业务,以便制定改进的时间安排政策,查明资源利用效率低、能源/电力消耗、故障预测以及查明违反政策的情况。在本文件中,我们介绍了MIT Supercloud数据集,目的是促进采用创新的AI/ML方法来分析大规模人工智能计算机和数据中心/库洛德操作。我们提供了MIT Supercloud系统的详细监测日志,其中包括按工作、记忆使用、档案系统日志和物理监测数据。本文将讨论数据集、收集方法、数据提供情况以及未来数据的挑战。