Source code summaries are important for program comprehension and maintenance. However, there are plenty of programs with missing, outdated, or mismatched summaries. Recently, deep learning techniques have been exploited to automatically generate summaries for given code snippets. To achieve a profound understanding of how far we are from solving this problem and provide suggestions to future research, in this paper, we conduct a systematic and in-depth analysis of 5 state-of-the-art neural code summarization models on 6 widely used BLEU variants, 4 pre-processing operations and their combinations, and 3 widely used datasets. The evaluation results show that some important factors have a great influence on the model evaluation, especially on the performance of models and the ranking among the models. However, these factors might be easily overlooked. Specifically, (1) the BLEU metric widely used in existing work of evaluating code summarization models has many variants. Ignoring the differences among these variants could greatly affect the validity of the claimed results. Furthermore, we conduct human evaluations and find that the metric BLEU-DC is most correlated to human perception; (2) code pre-processing choices can have a large (from -18\% to +25\%) impact on the summarization performance and should not be neglected. We also explore the aggregation of pre-processing combinations and boost the performance of models; (3) some important characteristics of datasets (corpus sizes, data splitting methods, and duplication ratios) have a significant impact on model evaluation. Based on the experimental results, we give actionable suggestions for evaluating code summarization and choosing the best method in different scenarios. We also build a shared code summarization toolbox to facilitate future research.
翻译:源代码摘要对于程序的理解和维护很重要。 但是, 有很多程序缺少、 过时或不匹配的摘要。 最近, 深层次的学习技巧被用来自动生成给定代码片段的摘要。 为了深入了解我们离解决这一问题有多远, 并为未来研究提供建议, 在本文件中, 我们对广泛使用的六种BLEU变量、 4种预处理操作及其组合以及 3种广泛使用的数据集的5种最先进的神经代码汇总模型进行系统和深入的分析。 评价结果显示, 一些重要的因素对模型评估有很大影响, 特别是对模型的性能和模型的排名。 然而, 这些因素可能很容易被忽视。 具体地说, 在目前评估代码拼凑模型的工作中广泛使用的BLEU标准有许多变式。 忽略这些变式之间的差异会大大影响所声称的结果的有效性。 此外, 我们进行人类评估, 发现 标准 BLEUDC 与人类感知最相关; (2) 代码预处理选择对模型的影响, 特别是模型的性能影响, 也有可能大大( 从18年到20年) 和20年 之前的数据组合。