Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.
翻译:计算笔记本,如Jupyter笔记本,是交互式计算环境,数据科学家普遍使用这种环境来完成数据重复和分析任务。为了测量对齐程序员的性能,这些对齐程序员自动合成程序,以完成用户自然语言(NL)意图赋予的任务,我们用数据科学笔记本中的熊猫数据分析框架来建立1082种代码生成问题基准ARCADE。ARCADE具有同一笔记本上多轮NL对代码问题的特点。它需要一个模型来理解丰富的多模式环境,例如现有的笔记本细胞及其执行状态以及以前的交互转换。为了在这一具有挑战性的任务上建立强有力的基线,我们为Python计算笔记本开发了62B代码语言模型(LM),它大大超越了公共代码LMs。最后,我们探索了几发快速战略,以通过一步分解和NLL解释来获取更好的代码,显示改进模型预测多样性和可解释的可能性。