Developing models that can automatically generate detailed code explanation can greatly benefit software maintenance and programming education. However, existing code-to-text generation models often produce only high-level summaries of code that do not capture implementation-level choices essential for these scenarios. To fill in this gap, we propose the code explanation generation task. We first conducted a human study to identify the criteria for high-quality explanatory docstring for code. Based on that, we collected and refined a large-scale code docstring corpus and formulated automatic evaluation metrics that best match human assessments. Finally, we present a multi-stage fine-tuning strategy and baseline models for the task. Our experiments show that (1) our refined training dataset lets models achieve better performance in the explanation generation tasks compared to larger unrefined data (15x larger), and (2) fine-tuned models can generate well-structured long docstrings comparable to human-written ones. We envision our training dataset, human-evaluation protocol, recommended metrics, and fine-tuning strategy can boost future code explanation research. The code and annotated data are available at https://github.com/subercui/CodeExp.
翻译:开发能够自动生成详细代码解释的模型可以极大地有利于软件的维护和编程教育。然而,现有的代码到文本生成模型往往只产生高级代码摘要,而这种摘要并不反映执行层面的选择对于这些假设方案至关重要。为了填补这一空白,我们提议了代码解释生成任务。我们首先进行了一项人类研究,以确定高质量代码解释性记录的标准。在此基础上,我们收集并完善了一个大型代码记录程序,并制定了与人类评估最匹配的自动评估标准。最后,我们提出了一个多阶段的微调战略和基准模型。我们的实验显示:(1)我们经过改进的培训数据集使模型在解释性生成任务方面实现更好的业绩,而与更大的未定义数据(15x大)相比,并且(2)经过微调的模式可以产生结构完善的长的代码字符串。我们设想我们的培训数据集、人类评估协议、推荐的测量标准以及微调战略可以促进未来的代码解释研究。代码和附加说明的数据可在 https://github.com/subercicui/CodeExplain上查阅。