Automatic code generation, the task of generating new code snippets from existing code or comments, has long been of interest. Numerous code generation models have been proposed and proven on different benchmark datasets. However, little is known about whether this objective has been achieved and why code generation models effectively transform code sequences automatically. In other words, can we totally trust these automated code generation models? Consequently, there is a pressing need to understand the inner logic of code generation models and to investigate their replicability, reliability, and explainability. To bridge these research gaps, we conduct a thorough empirical study of five code generation models on four representative code generation datasets to assess the limits and capabilities of automatic code generation approaches. We further employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code generation. Experiments demonstrate that we successfully replicate state-of-the-art code generation approaches. We discover that state-of-the-art approaches suffer from severe data duplication and input insensitivity, which are subtle issues with significant implications. Our explainability analysis reveals that, in various experimental scenarios, code generation models can recognize code grammar and structural information, but can not capture key tokens that need to be updated. Our results draw several lessons and guidelines for future work in this area.
翻译:长久以来,人们一直关注从现有代码或评论中产生新的代码片断的自动代码生成任务。许多代码生成模型已经提出,并在不同的基准数据集上得到验证。然而,对于这一目标是否已经实现以及为什么代码生成模型是否自动地转换代码序列,人们知之甚少。换句话说,我们能否完全信任这些自动代码生成模型?因此,迫切需要理解代码生成模型的内部逻辑,并调查其可复制性、可靠性和可解释性。为了弥合这些研究差距,我们进行了一项彻底的经验性研究,对关于四个代议代码生成数据集的五个代码生成模型进行了全面的经验性研究,以评估自动代码生成方法的局限性和能力。我们进一步采用了先进的可解释的AI方法来突出为代码生成做出重大贡献的标志。实验表明,我们成功地复制了最新版代码生成方法;因此,我们发现,目前最先进的方法受到数据重复和敏感度的严重影响,这是具有重大意义的问题。我们的解释性分析表明,在各种实验情景下,代码生成模型能够识别代码制图和结构信息的范围和能力。我们无法为今后的工作总结一些关键符号。