Data Scientists often use notebooks to develop Data Science (DS) pipelines, particularly since they allow to selectively execute parts of the pipeline. However, notebooks for DS have many well-known flaws. We focus on the following ones in this paper: (1) Notebooks can become littered with code cells that are not part of the main DS pipeline but exist solely to make decisions (e.g. listing the columns of a tabular dataset). (2) While users are allowed to execute cells in any order, not every ordering is correct, because a cell can depend on declarations from other cells. (3) After making changes to a cell, this cell and all cells that depend on changed declarations must be rerun. (4) Changes to external values necessitate partial re-execution of the notebook. (5) Since cells are the smallest unit of execution, code that is unaffected by changes, can inadvertently be re-executed. To solve these issues, we propose to replace cells as the basis for the selective execution of DS pipelines. Instead, we suggest populating a context-menu for variables with actions fitting their type (like listing columns if the variable is a tabular dataset). These actions are executed based on a data-flow analysis to ensure dependencies between variables are respected and results are updated properly after changes. Our solution separates pipeline code from decision making code and automates dependency management, thus reducing clutter and the risk of making errors.
翻译:数据科学家经常使用笔记本来开发数据科学管道,特别是因为它们允许选择性执行管道的部分。然而,数据科学笔记本具有许多众所周知的缺陷。本文重点关注以下问题:(1)笔记本可以变得杂乱无章,存在许多与主要数据科学管道无关的代码单元格,但仅用于进行决策(例如列出表格数据集的列)。 (2)尽管用户可以按任意顺序执行单元格,但并非每个排序都正确,因为单元格可能依赖于其他单元格的声明。 (3)对单元格进行更改后,必须重新运行此单元格以及所有依赖其更改的单元格。 (4)更改外部值需要部分重新运行笔记本电脑。 (5)由于单元格是最小的执行单元,不受更改影响的代码可能会无意中被重新执行。为了解决这些问题,我们提出了替代方案,以替代单元格作为数据科学管道的选择性执行基础。相反,我们建议在上下文菜单中为变量填充适合其类型的操作(例如,如果变量是表格数据集,则列出列)。根据数据流分析执行这些操作,以确保变量之间的依赖关系得到尊重,并且在更改后正确更新结果。我们的解决方案将管道代码与决策制定代码分离,并自动化依赖关系管理,从而减少杂乱无章的情况并降低制作错误的风险。