There has been a massive explosion of data generated by customers and retained by companies in the last decade. However, there is a significant mismatch between the increasing volume of data and the lack of automation methods and tools. The lack of best practices in data science programming may lead to software quality degradation, release schedule slippage, and budget overruns. To mitigate these concerns, we would like to bring software engineering best practices into data science. Specifically, we focus on automated data validation in the data preparation phase of the software development life cycle. This paper studies a real-world industrial case and applies software engineering best practices to develop an automated test harness called RESTORE. We release RESTORE as an open-source R package. Our experience report, done on the geodemographic data, shows that RESTORE enables efficient and effective detection of errors injected during the data preparation phase. RESTORE also significantly reduced the cost of testing. We hope that the community benefits from the open-source project and the practical advice based on our experience.
翻译:过去十年来,客户产生并由公司保留的数据大爆炸,然而,数据数量不断增加与缺乏自动化方法和工具之间有很大的不匹配。在数据科学编程方面缺乏最佳做法可能导致软件质量退化、发布时间表滑坡和预算超支。为缓解这些关切,我们希望将软件工程最佳做法纳入数据科学。具体地说,我们在软件开发生命周期的数据编制阶段注重数据自动验证。本文研究一个现实世界的工业案例,并应用软件工程最佳做法开发一个名为Restore的自动测试工具。我们将Restore作为开放源码R软件包发布。我们关于地理人口数据的经验报告显示,REStorE能够高效和有效地发现数据编制阶段出现的错误。RESTORE还大幅降低了测试成本。我们希望社区从开放源项目和基于我们经验的实用建议中受益。