We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-002-predicted solutions that our evaluation accept, only 1.8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43.3% accuracy, leaving ample room for improvement. We release our benchmark at https://ds1000-code-gen.github.io.
翻译:我们引入了DS-1000代码生成基准,其数据科学问题涉及七个Python图书馆,如NumPy和Pandas。与先前的工程相比,DS-1000包含三个核心特征。首先,我们的问题反映了我们从StackOverproll收集到的多样化、现实和实际使用案例。第二,我们的自动评估非常具体(可靠) -- -- 我们接受所有代码002预设的解决方案,其中只有1.8%是不正确的;我们用多标准衡量标准来实现这一目标,通过运行测试案例来检查功能正确性,并通过限制API的使用或关键词来检查地表格式限制。最后,我们通过稍微修改我们的问题与原始的StackOververption来源不同来积极防范记忆化;因此,模型无法正确回答这些问题,通过对培训前的解决方案进行记忆。目前最好的公共系统(Codex-002)实现了43.3%的准确度,留下足够的改进空间。我们在https://ds1000-code-gen.github.io上公布了我们的基准。