试运行联合驾驶和编码:热温、冷感或黑魔法? (Piloting Copilot and Codex: Hot Temperature, Cold Prompts, or Black Magic?)

Language models are promising solutions for tackling increasing complex problems. In software engineering, they recently attracted attention in code assistants, with programs automatically written in a given programming language from a programming task description in natural language. They have the potential to save time and effort when writing code. However, these systems are currently poorly understood, preventing them from being used optimally. In this paper, we investigate the various input parameters of two language models, and conduct a study to understand if variations of these input parameters (e.g. programming task description and the surrounding context, creativity of the language model, number of generated solutions) can have a significant impact on the quality of the generated programs. We design specific operators for varying input parameters and apply them over two code assistants (Copilot and Codex) and two benchmarks representing algorithmic problems (HumanEval and LeetCode). Our results showed that varying the input parameters can significantly improve the performance of language models. However, there is a tight dependency when varying the temperature, the prompt and the number of generated solutions, making potentially hard for developers to properly control the parameters to obtain an optimal result. This work opens opportunities to propose (automated) strategies for improving performance.

翻译：语言模型是解决日益复杂的问题的有希望的解决方案。在软件工程中,它们最近吸引了代码助理的注意力,其程序从自然语言的编程任务描述中自动用特定编程语言写成,在写代码时有可能节省时间和精力。然而,这些系统目前理解不善,因此无法最佳地使用这些系统。在本文件中,我们调查了两种语言模型的各种输入参数,并进行了一项研究,以了解这些输入参数的变异(例如,程序任务描述和周围环境、语言模型的创造性、生成的解决方案的数量)是否会对生成程序的质量产生重大影响。我们设计了不同的输入参数,并将其适用于两个代码助理(Coopultur and Code)和两个代表算法问题的基准(HumanEval and LeetCode)。我们的结果显示,不同的输入参数可以大大改善语言模型的性能。然而,当温度、迅速性和生成的解决方案的数量变化时,对于开发者可能很难正确控制参数以取得最佳结果。这项工作开启了提出(自动化)战略的机会。