Translated title: 运行时变化对大数据分析的影响 Translated abstract: 云环境资源配置和运行时条件的动态变化可能导致作业的运行时间在多次迭代中出现高度变异，从而导致用户体验质量下降。识别此类变异源并能够预测和调整它们对云服务提供商来说至关重要，以设计可靠的数据处理管道，提供和分配资源，调整定价服务，满足服务级别协议并调试性能隐患。本文分析了微软内部以 Exabyte 规模为特点的分析平台 Cosmos 上数百万个生产 SCOPE 作业的运行时变异。我们提出了一种创新的两步方法，通过表征典型的分布形状结合分类模型来预测作业的运行时间分布，平均准确率超过 96％，优于传统回归模型并更好地捕捉长尾。我们考察了作业计划特性和输入、资源分配、物理集群异构性和利用率以及调度策略等因素。据我们所知，这是首次对企业级分析工作负载的运行时分布进行大规模分类预测研究。此外，我们还研究了如何使用我们的方法来分析“假设”场景，重点关注资源分配、调度和物理集群供应决策对作业运行时间一致性和可预测性的影响。 (Runtime Variation in Big Data Analytics)

2023 年 4 月 7 日

翻译：Translated title: 运行时变化对大数据分析的影响 Translated abstract: 云环境资源配置和运行时条件的动态变化可能导致作业的运行时间在多次迭代中出现高度变异，从而导致用户体验质量下降。识别此类变异源并能够预测和调整它们对云服务提供商来说至关重要，以设计可靠的数据处理管道，提供和分配资源，调整定价服务，满足服务级别协议并调试性能隐患。本文分析了微软内部以 Exabyte 规模为特点的分析平台 Cosmos 上数百万个生产 SCOPE 作业的运行时变异。我们提出了一种创新的两步方法，通过表征典型的分布形状结合分类模型来预测作业的运行时间分布，平均准确率超过 96％，优于传统回归模型并更好地捕捉长尾。我们考察了作业计划特性和输入、资源分配、物理集群异构性和利用率以及调度策略等因素。据我们所知，这是首次对企业级分析工作负载的运行时分布进行大规模分类预测研究。此外，我们还研究了如何使用我们的方法来分析“假设”场景，重点关注资源分配、调度和物理集群供应决策对作业运行时间一致性和可预测性的影响。

Yiwen Zhu,Rathijit Sen,Robert Horton,John Mark, Agosta

from arxiv, Sigmod 2023

The dynamic nature of resource allocation and runtime conditions on Cloud can result in high variability in a job's runtime across multiple iterations, leading to a poor experience. Identifying the sources of such variation and being able to predict and adjust for them is crucial to cloud service providers to design reliable data processing pipelines, provision and allocate resources, adjust pricing services, meet SLOs and debug performance hazards. In this paper, we analyze the runtime variation of millions of production SCOPE jobs on Cosmos, an exabyte-scale internal analytics platform at Microsoft. We propose an innovative 2-step approach to predict job runtime distribution by characterizing typical distribution shapes combined with a classification model with an average accuracy of >96%, out-performing traditional regression models and better capturing long tails. We examine factors such as job plan characteristics and inputs, resource allocation, physical cluster heterogeneity and utilization, and scheduling policies. To the best of our knowledge, this is the first study on predicting categories of runtime distributions for enterprise analytics workloads at scale. Furthermore, we examine how our methods can be used to analyze what-if scenarios, focusing on the impact of resource allocation, scheduling, and physical cluster provisioning decisions on a job's runtime consistency and predictability.

翻译：注意事项：译文中专有名词需用英文标记，例如：SCOPE。