Jupyter notebooks represent a unique format for programming - a combination of code and Markdown with rich formatting, separated into individual cells. We propose to perceive a Jupyter Notebook cell as a simplified and raw version of a programming function. Similar to functions, Jupyter cells should strive to contain singular, self-contained actions. At the same time, research shows that real-world notebooks fail to do so and suffer from the lack of proper structure. To combat this, we propose ReSplit, an algorithm for an automatic re-splitting of cells in Jupyter notebooks. The algorithm analyzes definition-usage chains in the notebook and consists of two parts - merging and splitting the cells. We ran the algorithm on a large corpus of notebooks to evaluate its performance and its overall effect on notebooks, and evaluated it by human experts: we showed them several notebooks in their original and the re-split form. In 29.5% of cases, the re-split notebook was selected as the preferred way of perceiving the code. We analyze what influenced this decision and describe several individual cases in detail.
翻译:Jupyter 笔记本是一种独特的编程格式 -- -- 将代码和标记分解成丰富的格式化,分离成单细胞。我们提议将Jupyter Notesbook 单元格视为一个简化和原始的编程功能。与功能相似,Jupyter 单元格应努力包含单一的、自足的行动。与此同时,研究表明真实世界的笔记本没有这样做,并且缺乏适当的结构。为了解决这一问题,我们提议了ReSplit,这是在Jupyter 笔记本中自动重新拆分细胞的算法。算法分析笔记本中的定义链,由两部分组成:合并和拆分细胞。我们用大量笔记本对算算算法来评价其性能及其对笔记本的总体影响,并由人类专家对其进行评价:我们用原始和重新版格式向他们展示了数本笔记本。在29.5%的案件中,重新版笔记本被选为理解代码的首选方法。我们分析了如何影响这一决定,并详细描述了几个个案。