Many software engineering research papers rely on time-based data (e.g., commit timestamps, issue report creation/update/close dates, release dates). Like most real-world data however, time-based data is often dirty. To date, there are no studies that quantify how frequently such data is used by the software engineering research community, or investigate sources of and quantify how often such data is dirty. Depending on the research task and method used, including such dirty data could affect the research results. This paper presents the first survey of papers that utilize time-based data, published in the Mining Software Repositories (MSR) conference series. Out of the 690 technical track and data papers published in MSR 2004--2020, we saw at least 35% of papers utilized time-based data. We then used the Boa and Software Heritage infrastructures to help identify and quantify several sources of dirty commit timestamp data. Finally we provide guidelines/best practices for researchers utilizing time-based data from Git repositories.
翻译:许多软件工程研究文件依赖基于时间的数据(例如,承诺时间戳、发布报告创建/更新/关闭日期、发布日期),但与大多数现实世界数据一样,时间数据往往肮脏。迄今为止,还没有研究量化软件工程研究界使用这些数据的频率,或调查这些数据的频率,或调查此类数据的来源,并量化这些数据的频率。视研究任务和方法而定,包括此类肮脏数据可能会影响研究成果。本文件首次调查利用采矿软件储存库系列会议公布的基于时间的数据的文件。在2004-2020年采矿软件储存库出版的690份技术轨道和数据文件中,我们看到至少35%的文件使用了基于时间的数据。我们随后利用博阿和软件遗产基础设施帮助识别和量化若干基于时间戳的数据。最后,我们为研究人员利用来自Git储存库的时间数据提供了指导方针/最佳做法。