In this perspective, we argue that despite the democratization of powerful tools for data science and machine learning over the last decade, developing the code for a trustworthy and effective data science system (DSS) is getting harder. Perverse incentives and a lack of widespread software engineering (SE) skills are among many root causes we identify that naturally give rise to the current systemic crisis in reproducibility of DSSs. We analyze why SE and building large complex systems is, in general, hard. Based on these insights, we identify how SE addresses those difficulties and how we can apply and generalize SE methods to construct DSSs that are fit for purpose. We advocate two key development philosophies, namely that one should incrementally grow -- not biphasically plan and build -- DSSs, and one should always employ two types of feedback loops during development: one which tests the code's correctness and another that evaluates the code's efficacy.
翻译:从这个角度看,我们争论说,尽管在过去十年里,数据科学和机器学习的强大工具已经民主化,但为可靠和有效的数据科学系统(DSS)制定守则的工作正在变得更加困难。相反的激励和缺乏广泛的软件工程技能是许多根本原因之一。我们发现,自然而然地引发了目前DSS再生的系统性危机。我们分析了SE和建设大型复杂系统一般为什么很困难。根据这些见解,我们确定SE如何解决这些困难,以及我们如何能够应用和概括SE方法来构建适合目的的DSS。我们提倡两种关键的发展哲学,即一个是逐步增长,而不是两面计划和构建DSS,另一个在发展过程中总是采用两种反馈循环:一种是测试代码的正确性,另一种是评估代码的功效。