To extend the scope of coding queries to more realistic settings, we propose ODEX, the first open-domain execution-based natural language (NL) to code generation dataset. ODEX has 945 NL-Code pairs spanning 79 diverse libraries, along with 1,707 human-written test cases for execution. Our NL-Code pairs are harvested from StackOverflow forums to encourage natural and practical coding queries, which are then carefully rephrased to ensure intent clarity and prevent potential data memorization. Moreover, ODEX supports four natural languages as intents, in English, Spanish, Japanese, and Russian. ODEX unveils intriguing behavioral differences between top-performing Code LMs: Codex performs better on open-domain queries, yet CodeGen captures a better balance between open- and closed-domain. ODEX corroborates the merits of execution-based evaluation over metrics without execution but also unveils their complementary effects. Powerful models such as CodeGen-6B only achieve an 11.96 pass rate at top-1 prediction, suggesting plenty of headroom for improvement. We release ODEX to facilitate research into open-domain problems for the code generation community.
翻译:为了将编码查询的范围扩大到更现实的设置,我们建议ODEX,这是第一个以开放域执行为基础的自然语言(NL),用于代码生成数据集。ODEX有945种NL-Code配对,涵盖79个不同的图书馆,还有1 707个人为的测试案例供执行。我们的NL-Code配对从StackOverslow论坛收获出来,以鼓励自然和实际的编码查询,然后仔细地重新表述,以确保意图清晰,防止潜在的数据记忆化。此外,ODEX支持四种自然语言,作为用英文、西班牙文、日文和俄文表示的意图。ODEX揭露了最高性能代码LMS之间令人好奇的行为差异:代码在开放域查询方面表现更好,而代码Gen则在开放域与封闭域之间取得更好的平衡。ODEX印证了在不执行的情况下对衡量标准进行基于执行的评价的优点,但也暴露了它们的互补效果。DODEX等强大的模型在顶层-1的预测中只达到11.96的通过率,这表明头室在生成软件方面有大量的问题。