Jupyter notebooks enable developers to interleave code snippets with rich-text and in-line visualizations. Data scientists use Jupyter notebook as the de-facto standard for creating and sharing machine-learning based solutions, primarily written in Python. Recent studies have demonstrated, however, that a large portion of Jupyter notebooks available on public platforms are undocumented and lacks a narrative structure. This reduces the readability of these notebooks. To address this shortcoming, this paper presents HeaderGen, a novel tool-based approach that automatically annotates code cells with categorical markdown headers based on a taxonomy of machine-learning operations, and classifies and displays function calls according to this taxonomy. For this functionality to be realized, HeaderGen enhances an existing call graph analysis in PyCG. To improve precision, HeaderGen extends PyCG's analysis with support for handling external library code and flow-sensitivity. The former is realized by facilitating the resolution of function return-types. Furthermore, HeaderGen uses type information to perform pattern matching on code syntax to annotate code cells. The evaluation on 15 real-world Jupyter notebooks from Kaggle shows that HeaderGen's underlying call graph analysis yields high accuracy (96.4% precision and 95.9% recall). This is because HeaderGen can resolve return-types of external libraries where existing type inference tools such as pytype (by Google), pyright (by Microsoft), and Jedi fall short. The header generation has a precision of 82.2% and a recall rate of 96.8% with regard to headers created manually by experts. In a user study, HeaderGen helps participants finish comprehension and navigation tasks faster. All participants clearly perceive HeaderGen as useful to their task.
翻译:Jupyter 笔记本使开发者能够将代码片断断断开来,使开发者能够用丰富的文本和线内可视化。数据科学家使用Jupyter笔记本作为创建和分享机器学习解决方案的脱法标准,主要是在Python中写。然而,最近的研究表明,在公共平台上提供的大量Jupyter笔记本没有记录,缺乏叙述结构。这降低了这些笔记本的可读性。为了解决这一缺陷,本文件展示了“页眉”这一基于新颖工具的方法,它以机器学习操作的分类学为基础,自动将绝对的标记头部头部头部头部头部头部头部作为分级标准,为了实现这一功能,HeadGen加强了PyCG. 现有的呼声图分析。为了提高精确度, HeadGenerfer将PyCG的分析扩展到支持处理外部图书馆代码和流敏度。前者可以通过促进功能回归型号的解析来实现。此外, HederGen 使用类型信息在代码类型中进行模式化图型信息,而进行模式,而要将精度头部头部头部头部头部的精确分析,因为Stytyty Gxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</s>