Automated source code summarization is a popular software engineering research topic wherein machine translation models are employed to "translate" code snippets into relevant natural language descriptions. Most evaluations of such models are conducted using automatic reference-based metrics. However, given the relatively large semantic gap between programming languages and natural language, we argue that this line of research would benefit from a qualitative investigation into the various error modes of current state-of-the-art models. Therefore, in this work, we perform both a quantitative and qualitative comparison of three recently proposed source code summarization models. In our quantitative evaluation, we compare the models based on the smoothed BLEU-4, METEOR, and ROUGE-L machine translation metrics, and in our qualitative evaluation, we perform a manual open-coding of the most common errors committed by the models when compared to ground truth captions. Our investigation reveals new insights into the relationship between metric-based performance and model prediction errors grounded in an empirically derived error taxonomy that can be used to drive future research efforts
翻译:自动源代码总和是一个流行的软件工程研究课题,机器翻译模型用来将代码片段“翻译”成相关的自然语言描述。对此类模型的大多数评价都是使用自动参考度量进行。然而,鉴于编程语言和自然语言之间的语义差异相对较大,我们认为,对当前最新模型的各种错误模式进行定性调查,将有益于这一研究系列。因此,在这项工作中,我们对最近提出的三种源代码总和模型进行定量和定性比较。在定量评估中,我们比较了基于光滑的 BLEU-4、METEOR和ROUGE-L机器翻译指标的模型,在定性评估中,我们用手册对模型在与地面事实说明相比时最常见的错误进行了公开编码。我们的调查揭示了基于经验得出的错误分类学的基于计量的业绩和模型预测错误之间的关系,这些错误可用于推动今后的研究工作。