自动数据验证:工业经验报告 (Automated data validation: an industrial experience report)

There has been a massive explosion of data generated by customers and retained by companies in the last decade. However, there is a significant mismatch between the increasing volume of data and the lack of automation methods and tools. The lack of best practices in data science programming may lead to software quality degradation, release schedule slippage, and budget overruns. To mitigate these concerns, we would like to bring software engineering best practices into data science. Specifically, we focus on automated data validation in the data preparation phase of the software development life cycle. This paper studies a real-world industrial case and applies software engineering best practices to develop an automated test harness called RESTORE. We release RESTORE as an open-source R package. Our experience report, done on the geodemographic data, shows that RESTORE enables efficient and effective detection of errors injected during the data preparation phase. RESTORE also significantly reduced the cost of testing. We hope that the community benefits from the open-source project and the practical advice based on our experience.

翻译：过去十年来,客户产生并由公司保留的数据大爆炸,然而,数据数量不断增加与缺乏自动化方法和工具之间有很大的不匹配。在数据科学编程方面缺乏最佳做法可能导致软件质量退化、发布时间表滑坡和预算超支。为缓解这些关切,我们希望将软件工程最佳做法纳入数据科学。具体地说,我们在软件开发生命周期的数据编制阶段注重数据自动验证。本文研究一个现实世界的工业案例,并应用软件工程最佳做法开发一个名为Restore的自动测试工具。我们将Restore作为开放源码R软件包发布。我们关于地理人口数据的经验报告显示,REStorE能够高效和有效地发现数据编制阶段出现的错误。RESTORE还大幅降低了测试成本。我们希望社区从开放源项目和基于我们经验的实用建议中受益。

相关内容

Automator

关注 0

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日