Computational notebook software such as Jupyter Notebook is popular for data science tasks. Numerous computational notebooks are available on the Web and reusable; however, searching for computational notebooks manually is a tedious task, and so far, there are no tools to search for computational notebooks effectively and efficiently. In this paper, we propose a similarity search on computational notebooks and develop a new framework for the similarity search. Given contents (i.e., source codes, tabular data, libraries, and outputs formats) in computational notebooks as a query, the similarity search problem aims to find top-k computational notebooks with the most similar contents. We define two similarity measures; set-based and graph-based similarities. Set-based similarity handles each content independently, while graph-based similarity captures the relationships between contents. Our framework can effectively prune the candidates of computational notebooks that should not be in the top-k results. Furthermore, we develop optimization techniques such as caching and indexing to accelerate the search. Experiments using Kaggle notebooks show that our method, in particular graph-based similarity, can achieve high accuracy and high efficiency.
翻译:诸如 Jupyter Notesbook 等计算笔记本软件在数据科学任务中很受欢迎。 许多计算笔记本可以在网上找到,并且可以重新使用; 但是, 人工搜索计算笔记本是一项乏味的任务, 到目前为止, 还没有工具来有效、 高效地搜索计算笔记本。 在本文中, 我们提议对计算笔记本进行类似搜索, 并为相似搜索开发一个新的框架 。 在计算笔记本中, 从内容( 即源代码、 表格数据、 图书馆和产出格式) 作为查询, 类似搜索问题的目的是找到内容最相似的顶端计算笔记本。 我们定义了两种相似性措施; 基于设置的和基于图形的相似性。 基于设置的相似性独立处理每个内容, 而基于图形的相似性则能捕捉内容之间的关系。 我们的框架可以有效地利用计算笔记本中不应该出现在顶级结果中的候选人。 此外, 我们开发了优化技术, 如刻和索引来加速搜索。 使用 Kaggle笔记本进行实验, 显示我们的方法, 特别是基于图表的高精度和高效率可以实现。