Current benchmarks for evaluating neural code models focus on only a small subset of programming languages, excluding many popular languages such as Go or Rust. To ameliorate this issue, we present the BabelCode framework for execution-based evaluation of any benchmark in any language. BabelCode enables new investigations into the qualitative performance of models' memory, runtime, and individual test case results. Additionally, we present a new code translation dataset called Translating Python Programming Puzzles (TP3) from the Python Programming Puzzles (Schuster et al. 2021) benchmark that involves translating expert-level python functions to any language. With both BabelCode and the TP3 benchmark, we investigate if balancing the distributions of 14 languages in a training dataset improves a large language model's performance on low-resource languages. Training a model on a balanced corpus results in, on average, 12.34% higher $pass@k$ across all tasks and languages compared to the baseline. We find that this strategy achieves 66.48% better $pass@k$ on low-resource languages at the cost of only a 12.94% decrease to high-resource languages. In our three translation tasks, this strategy yields, on average, 30.77% better low-resource $pass@k$ while having 19.58% worse high-resource $pass@k$.
翻译:用于评价神经代码模型的现有基准仅侧重于一小部分编程语言, 不包括许多流行语言, 如 Go 或 Rust 。 为改善这一问题, 我们提出 BabylCode 框架, 用于对任何语言的基准进行基于执行的评估。 BabylCode 能够对模型记忆、 运行时间和单个测试案例结果的质量表现进行新的调查。 此外, 我们从 Python 编程拼图( Schutster 等人. 2021) 中, 提出了一个新的代码翻译数据集, 名为 Translaft Python 编程拼图( TP3), 称为 Translaft Python 编程拼图( TP3) 。 我们发现, 将专家级的 Python 函数转换为任何语言的基准, 包括专家级 Python 函数。 在 BabelCode 和 TP3 基准下, 我们调查在培训数据集中平衡14种语言分布的分布是否使低资源模式的成绩提高, 0. 3 mission- passional lax pass le lex pass passion $ a pas pas pas on on on on on on only pasperal pasperal pour pasperal.