数据科学代码生成模型的基于执行的评价 (Execution-based Evaluation for Data Science Code Generation Models)

Code generation models can benefit data scientists' productivity by automatically generating code from context and text descriptions. An important measure of the modeling progress is whether a model can generate code that can correctly execute to solve the task. However, due to the lack of an evaluation dataset that directly supports execution-based model evaluation, existing work relies on code surface form similarity metrics (e.g., BLEU, CodeBLEU) for model selection, which can be inaccurate. To remedy this, we introduce ExeDS, an evaluation dataset for execution evaluation for data science code generation tasks. ExeDS contains a set of 534 problems from Jupyter Notebooks, each consisting of code context, task description, reference program, and the desired execution output. With ExeDS, we evaluate the execution performance of five state-of-the-art code generation models that have achieved high surface-form evaluation scores. Our experiments show that models with high surface-form scores do not necessarily perform well on execution metrics, and execution-based metrics can better capture model code generation errors. Source code and data can be found at https://github.com/Jun-jie-Huang/ExeDS

翻译：代码生成模型可以通过从上下文和文本描述中自动生成代码而使数据科学家的生产力受益。模型进展的一个重要衡量尺度是模型能否生成能够正确执行的代码以解决这个问题。然而,由于缺乏直接支持基于执行的模式评估的评价数据集,现有工作依靠代码表面形式的相似度度量(如BLEU、CobleU)进行模型选择,而这种选择可能不准确。为了纠正这一点,我们引入了ExeDS,这是用于数据科学代码生成任务执行评价的评价数据集。ExeDS包含一套来自Jupyter笔记本的534个问题,每个问题都包含代码背景、任务描述、参考程序以及预期的执行输出。我们与ExeDS一起评价了5个最先进的代码生成模型(如BLEU、Coilble、CoblebleU)的实施绩效,这些模型可能不准确。我们的实验表明,高地表格式分模型不一定很好地执行指标,而基于执行的指标可以更好地捕捉到模式代码生成错误。源码和数据可以在 https://github.com/Jun-ji-HER-EXDSASS/DSA中找到。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【ICDM 2022教程】图挖掘中的公平性:度量、算法和应用

专知会员服务

28+阅读 · 2022年12月26日

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

65+阅读 · 2020年5月12日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日