应用贝叶斯分析准则处理实证软件工程数据:编程语言和守则质量案例 (Applying Bayesian Analysis Guidelines to Empirical Software Engineering Data: The Case of Programming Languages and Code Quality)

Statistical analysis is the tool of choice to turn data into information, and then information into empirical knowledge. To be valid, the process that goes from data to knowledge should be supported by detailed, rigorous guidelines, which help ferret out issues with the data or model, and lead to qualified results that strike a reasonable balance between generality and practical relevance. Such guidelines are being developed by statisticians to support the latest techniques for Bayesian data analysis. In this article, we frame these guidelines in a way that is apt to empirical research in software engineering. To demonstrate the guidelines in practice, we apply them to reanalyze a GitHub dataset about code quality in different programming languages. The dataset's original analysis (Ray et al., 2014) and a critical reanalysis (Berger at al., 2019) have attracted considerable attention -- in no small part because they target a topic (the impact of different programming languages) on which strong opinions abound. The goals of our reanalysis are largely orthogonal to this previous work, as we are concerned with demonstrating, on data in an interesting domain, how to build a principled Bayesian data analysis and to showcase some of its benefits. In the process, we will also shed light on some critical aspects of the analyzed data and of the relationship between programming languages and code quality. The high-level conclusions of our exercise will be that Bayesian statistical techniques can be applied to analyze software engineering data in a way that is principled, flexible, and leads to convincing results that inform the state of the art while highlighting the boundaries of its validity. The guidelines can support building solid statistical analyses and connecting their results, and hence help buttress continued progress in empirical software engineering research.

翻译：统计分析是将数据转化为信息、然后将信息转化为经验知识的首选工具。数据分析是将数据转化为信息、然后将信息转化为信息转化为经验知识的首选工具。要做到有效,从数据到知识的过程应当得到详细、严格的指南的支持,这些指南有助于揭示数据或模型的问题,并导致在一般性和实际相关性之间取得合理平衡的合格结果。统计人员正在制定这些指南,以支持贝叶斯语数据分析的最新技术。在本条中,我们制定这些指南的方式适合于软件工程实验研究。为了在实践中证明这些指南的有效性,我们应用这些指南来重新分析关于不同编程语言的代码质量的GitHub数据集。数据集的原始分析(Ray等人,2014年)和重要的重新分析结果(Berger等人,2019年)已经引起人们的极大关注 -- -- 因为它们针对一个主题(不同编程语言的影响),并有大量的意见。我们的重新分析目标在很大程度上是灵活的,正如我们所关心的那样,在一个有趣的域里展示如何建立关于数据质量的GitHius 数据库, 如何在不断构建一个不断构建一种对贝斯语级数据进行推介路的系统分析结果, 以及我们的数据分析的系统分析,从而在高层次分析数据分析中将产生某种数据分析结果分析, 将使得我们的数据在高层次关系,我们的数据分析将获得某种数据分析结果分析结果分析结果分析,我们的数据分析将使得在高层次分析会得到某种分析。

相关内容

Engineering

关注 6

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering

【经典书】应用随机微分方程，324页pdf，Applied Stochastic Differential Equations

专知会员服务

58+阅读 · 2020年11月21日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

数据科学导论，54页ppt，Introduction to Data Science

专知会员服务

42+阅读 · 2020年7月27日

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日