Code generation models can benefit data scientists' productivity by automatically generating code from context and text descriptions. An important measure of the modeling progress is whether a model can generate code that can correctly execute to solve the task. However, due to the lack of an evaluation dataset that directly supports execution-based model evaluation, existing work relies on code surface form similarity metrics (e.g., BLEU, CodeBLEU) for model selection, which can be inaccurate. To remedy this, we introduce ExeDS, an evaluation dataset for execution evaluation for data science code generation tasks. ExeDS contains a set of 534 problems from Jupyter Notebooks, each consisting of code context, task description, reference program, and the desired execution output. With ExeDS, we evaluate the execution performance of five state-of-the-art code generation models that have achieved high surface-form evaluation scores. Our experiments show that models with high surface-form scores do not necessarily perform well on execution metrics, and execution-based metrics can better capture model code generation errors. Source code and data can be found at https://github.com/Jun-jie-Huang/ExeDS
翻译:代码生成模型可以通过从上下文和文本描述中自动生成代码而使数据科学家的生产力受益。模型进展的一个重要衡量尺度是模型能否生成能够正确执行的代码以解决这个问题。然而,由于缺乏直接支持基于执行的模式评估的评价数据集,现有工作依靠代码表面形式的相似度度量(如BLEU、CobleU)进行模型选择,而这种选择可能不准确。为了纠正这一点,我们引入了ExeDS,这是用于数据科学代码生成任务执行评价的评价数据集。ExeDS包含一套来自Jupyter笔记本的534个问题,每个问题都包含代码背景、任务描述、参考程序以及预期的执行输出。我们与ExeDS一起评价了5个最先进的代码生成模型(如BLEU、Coilble、CoblebleU)的实施绩效,这些模型可能不准确。我们的实验表明,高地表格式分模型不一定很好地执行指标,而基于执行的指标可以更好地捕捉到模式代码生成错误。源码和数据可以在 https://github.com/Jun-ji-HER-EXDSASS/DSA中找到。