Using computational notebooks (e.g., Jupyter Notebook), data scientists rationalize their exploratory data analysis (EDA) based on their prior experience and external knowledge such as online examples. For novices or data scientists who lack specific knowledge about the dataset or problem to investigate, effectively obtaining and understanding the external information is critical to carry out EDA. This paper presents EDAssistant, a JupyterLab extension that supports EDA with in-situ search of example notebooks and recommendation of useful APIs, powered by novel interactive visualization of search results. The code search and recommendation are enabled by state-of-the-art machine learning models, trained on a large corpus of EDA notebooks collected online. A user study is conducted to investigate both EDAssistant and data scientists' current practice (i.e., using external search engines). The results demonstrate the effectiveness and usefulness of EDAssistant, and participants appreciated its smooth and in-context support of EDA. We also report several design implications regarding code recommendation tools.
翻译:利用计算笔记本(如Jupyter Notesbook),数据科学家根据他们以往的经验和网上实例等外部知识,使其探索性数据分析合理化(EDA),对于缺乏关于数据集或问题的具体知识以调查、有效获取和理解外部信息的新学者或数据科学家来说,这是实施EDA的关键。本文介绍了EDA助理公司,这是一个JupyterLab扩展公司,支持EDA的现场搜索示例笔记本和有用的API的建议,其动力是新颖的交互可视化搜索结果。代码搜索和建议是由最新的机器学习模型促成的,该模型在网上收集了大量的EDA笔记本上接受培训。用户研究是为了调查ED A助理公司和数据科学家的现行做法(即使用外部搜索引擎),结果显示EDA助理公司的有效性和效用,与会者赞赏EDA助理公司的平稳和文字支持。我们还报告了关于代码建议工具的若干设计影响。