With the advent of open source software, a veritable treasure trove of previously proprietary software development data was made available. This opened the field of empirical software engineering research to anyone in academia. Data that is mined from software projects, however, requires extensive processing and needs to be handled with utmost care to ensure valid conclusions. Since the software development practices and tools have changed over two decades, we aim to understand the state-of-the-art research workflows and to highlight potential challenges. We employ a systematic literature review by sampling over one thousand papers from leading conferences and by analyzing the 286 most relevant papers from the perspective of data workflows, methodologies, reproducibility, and tools. We found that an important part of the research workflow involving dataset selection was particularly problematic, which raises questions about the generality of the results in existing literature. Furthermore, we found a considerable number of papers provide little or no reproducibility instructions -- a substantial deficiency for a data-intensive field. In fact, 33% of papers provide no information on how their data was retrieved. Based on these findings, we propose ways to address these shortcomings via existing tools and also provide recommendations to improve research workflows and the reproducibility of research.
翻译:随着开放源码软件的到来,提供了以前专有软件开发数据的真正宝藏。这为学术界的任何人都打开了经验性软件工程研究领域。但是,从软件项目中提取的数据需要广泛的处理,需要极为谨慎地处理,以确保得出有效的结论。由于软件开发做法和工具已经改变了20多年,我们的目标是了解最新的研究工作流程,并突出潜在的挑战。我们采用系统化的文献审查方法,从领导会议的1,000多份文件取样,并从数据工作流程、方法、可复制性和工具的角度分析286份最相关的文件。我们发现,涉及数据集选择的研究工作流程的一个重要部分特别成问题,这引起了关于现有文献结果的一般性的问题。此外,我们发现大量文件很少或根本没有提供可复制的指示 -- -- 一个数据密集的领域的严重缺陷。事实上,33%的文件没有提供关于如何检索数据的信息。根据这些发现,我们提出了如何通过现有工具来弥补这些缺陷的方法,并提出了改进研究工作流程和可复制性的建议。