Despite the increase in popularity of language models for code generation, it is still unknown how training on bimodal coding forums affects a model's code generation performance and reliability. We, therefore, collect a dataset of over 2.2M StackOverflow questions with answers for finetuning. These fine-tuned models have average $pass@k$ improvements of 54.64% and 85.35% on the HumanEval (Chen et al., 2021) and Mostly Basic Program Problems (Austin et al., 2021) tasks, respectively. This regime further decreases the number of generated programs with both syntax and runtime errors. However, we find that at higher temperatures, there are significant decreases to the model's ability to generate runnable programs despite higher $pass@k$ scores, underscoring the need for better methods of incorporating such data that mitigate these side effects. The code can be found https://github.com/gabeorlanski/bimodalcode-generation
翻译:尽管用于代码生成的语言模式越来越受欢迎,但目前还不清楚双式编码论坛的培训如何影响模型的代码生成性能和可靠性。 因此,我们收集了超过2.2M StackOverproll 问题的数据集,并给出了微调答案。这些微调模型在HumanEval(Chen等人,2021年)和最基本方案问题(Austin等人,2021年)的任务上平均改进了54.64%和85.35%。这个系统进一步减少了生成的带有语法和运行时差错误的方案的数量。然而,我们发现,在温度升高时,尽管分数更高,但模型生成可运行程序的能力却显著下降,这突出表明需要采用更好的方法纳入这类数据,以减轻这些副作用。可以找到该代码 https://github.com/gabeorlanski/bimodalcode-d一代。