Source code summaries are important for the comprehension and maintenance of programs. However, there are plenty of programs with missing, outdated, or mismatched summaries. Recently, deep learning techniques have been exploited to automatically generate summaries for given code snippets. To achieve a profound understanding of how far we are from solving this problem, in this paper, we conduct a systematic and in-depth analysis of five state-of-the-art neural source code summarization models on three widely used datasets. Our evaluation results suggest that: (1) The BLEU metric, which is widely used by existing work for evaluating the performance of the summarization models, has many variants. Ignoring the differences among the BLEU variants could affect the validity of the claimed results. Furthermore, we discover an important, previously unknown bug about BLEU calculation in a commonly-used software package. (2) Code pre-processing choices can have a large impact on the summarization performance, therefore they should not be ignored. (3) Some important characteristics of datasets (corpus size, data splitting method, and duplication ratio) have a significant impact on model evaluation. Based on the experimental results, we give some actionable guidelines on more systematic ways for evaluating code summarization and choosing the best method in different scenarios. We also suggest possible future research directions. We believe that our results can be of great help for practitioners and researchers in this interesting area.
翻译:源代码摘要对于理解和维护程序很重要。 但是,有很多程序缺少、过时或不匹配的摘要摘要。 最近,利用了深层次的学习技术来自动生成给定代码片段的摘要摘要。为了深入了解我们离解决这一问题有多远,我们在本文件中对五个最先进的神经源代码汇总模型进行了系统深入的分析,对三个广泛使用的数据集进行了系统深入的分析。我们的评价结果表明:(1) BLEU衡量标准(现有工作广泛用于评价合成模型的性能)有许多变异。无视BLEU变量的差异可能会影响所声称的结果的有效性。此外,我们发现一个重要、以前未知的错误,涉及在常用的软件包中计算BLEU值的问题。 (2) 代码预处理选择可能对合成性性业绩产生很大影响,因此不应忽视。 (3) 数据集的一些重要特征(公司规模、数据分离方法和重复率)对模型评估有重大影响。 基于对实验性结果的区别性结果,我们也可以对未来可能采用的方法做出判断。