There is abundant observational data in the software engineering domain, whereas running large-scale controlled experiments is often practically impossible. Thus, most empirical studies can only report statistical correlations -- instead of potentially more insightful and robust causal relations. This paper discusses some novel techniques that support analyzing purely observational data for causal relations. Using fundamental causal models such as directed acyclic graphs, one can rigorously express, and partially validate, causal hypotheses; and then use the causal information to guide the construction of a statistical model that captures genuine causal relations -- such that correlation does imply causation. We apply these ideas to analyzing public data about programmer performance in Code Jam, a large world-wide coding contest organized by Google every year. Specifically, we look at the impact of different programming languages on a participant's performance in the contest. While the overall effect associated with programming languages is weak compared to other variables -- regardless of whether we consider correlational or causal links -- we found considerable differences between a purely statistical and a causal analysis of the very same data. The takeaway message is that even an imperfect causal analysis of observational data can help answer the salient research questions more precisely and more robustly than with just purely statistical techniques.
翻译:软件工程领域有大量观测数据,而进行大规模控制实验则几乎是不可能的。因此,大多数实证研究只能报告统计相关性 -- -- 而不是可能更具洞察力和稳健的因果关系。本文讨论一些支持分析纯观察性因果关系数据的新颖技术。使用定向环形图等基本因果模型,人们可以严格地表达和部分地证实因果假设;然后利用因果信息来指导统计模型的构建,该模型捕捉真正的因果关系 -- -- 这种关联确实意味着因果关系。我们将这些想法用于分析关于代码Jam的程序员业绩的公共数据,这是谷歌每年组织的大规模全球编码竞赛。具体地说,我们审视不同方案语言对参与者竞争业绩的影响。虽然与编程语言相关的总体效果与其他变量相比弱 -- -- 无论我们是否考虑到相关或因果联系 -- -- 我们发现纯粹的统计和对同一数据进行因果分析之间存在相当大的差异。取自的信息是,即使观测数据的因果分析不完善,也有助于更准确地回答突出的研究问题,而不是纯粹的统计技术。