Many data scientists use Jupyter notebook to experiment code, visualize results, and document rationales or interpretations. The code documentation generation CDG task in notebooks is related but different from the code summarization task in software engineering, as one documentation (markdown cell) may consist of a text (informative summary or indicative rationale) for multiple code cells. Our work aims to solve the CDG task by encoding the multiple code cells as separated AST graph structures, for which we propose a hierarchical attention-based ConvGNN component to augment the Seq2Seq network. We build a dataset with publicly available Kaggle notebooks and evaluate our model (HAConvGNN) against baseline models (e.g., Code2Seq or Graph2Seq).
翻译:许多数据科学家使用Jupyter笔记本来实验代码、可视结果和文件理由或解释。 代码文档生成 CDG 任务在笔记本中是相关的,但与软件工程的代码汇总任务不同, 因为一个文件( 标记单元格) 可能包含多个代码单元格的文本( 说明性摘要或指示性理由 ) 。 我们的工作旨在通过将多个代码单元格编码为分隔的 AST 图形结构来解决 CDG 任务, 我们为此建议使用一个基于分级注意的 ConvGN 组件来增强 Seq2Seq 网络。 我们用可公开获取的 Kaggle 笔记本建立一个数据集, 并根据基线模型( 如 code2Seq 或 Grap2Seq) 评估我们的模型( HA ConvGN) 。