Many techniques were proposed for detecting software misconfigurations in cloud systems and for diagnosing unintended behavior caused by such misconfigurations. Detection and diagnosis are steps in the right direction: misconfigurations cause many costly failures and severe performance issues. But, we argue that continued focus on detection and diagnosis is symptomatic of a more serious problem: configuration design and implementation are not yet first-class software engineering endeavors in cloud systems. Little is known about how and why developers evolve configuration design and implementation, and the challenges that they face in doing so. This paper presents a source-code level study of the evolution of configuration design and implementation in cloud systems. Our goal is to understand the rationale and developer practices for revising initial configuration design/implementation decisions, especially in response to consequences of misconfigurations. To this end, we studied 1178 configuration-related commits from a 2.5 year version-control history of four large-scale, actively-maintained open-source cloud systems (HDFS, HBase, Spark, and Cassandra). We derive new insights into the software configuration engineering process. Our results motivate new techniques for proactively reducing misconfigurations by improving the configuration design and implementation process in cloud systems. We highlight a number of future research directions.
翻译:为发现云层系统中的软件配置错误和诊断这种配置错误造成的意外行为,提出了许多技术建议。检测和诊断是正确方向的步骤:错误配置导致许多代价高昂的故障和严重的性能问题。但我们认为,继续关注检测和诊断是一个更为严重的问题的症状:配置设计和实施还不是云层系统中的一级软件工程努力。对于开发商如何和为什么发展配置设计和实施以及他们在这方面面临的挑战,人们知之甚少。本文件对云层系统配置设计和实施的演变进行了源代码级研究。我们的目标是了解修改初始配置设计/执行决定的理由和开发者做法,特别是应对错误配置的后果。为此,我们研究了4个大规模、积极维护的开放源云系统(HDFS、HBase、Spark和Cassandra)的2.5年版本控制历史中与配置有关的承诺。我们从软件配置工程流程中获取新的洞察力。我们的目标是了解修改初步配置设计/执行决定的理由和开发方法,通过改进云层设计系统的数量来积极减少错误配置过程。