Large language models (LMs) are increasingly pretrained on massive corpora of open-source programs and applied to solve program synthesis tasks. However, a fundamental limitation of LMs is their unawareness of security and vulnerability during pretraining and inference. As a result, LMs produce secure or vulnerable programs with high uncertainty (e.g., around 60%/40% chances for GitHub Copilot according to a recent study). This greatly impairs LMs' usability, especially in security-sensitive scenarios. To address this limitation, this work formulates a new problem called controlled code generation, which allows users to input a boolean property into an LM to control if the LM generates secure or vulnerable code. We propose svGen, an effective and lightweight learning approach for solving controlled code generation. svGen leverages property-specific continuous vectors to steer program generation toward the given property, without altering the weights of the LM. svGen's training optimizes those continuous vectors by carefully applying specialized loss terms on different regions of code. Our extensive evaluation shows that svGen achieves strong control capability across various software vulnerabilities and LMs of different parameter sizes. For example, on 9 dangerous vulnerabilities, a state-of-the-art CodeGen LM with 2.7B parameters generates secure programs with a 57% chance. When we use svGen to control the LM to generate secure (resp., vulnerable) programs, the chance is significantly increased to 82% (resp., decreased to 35%).
翻译:大型语言模型(LMS)在大规模开放源码程序组合中日益受到训练,并被用于解决程序合成任务。然而,LMS的基本限制是,在培训前和推断中,他们不知道安全性和脆弱性。结果,LMS产生安全性或脆弱程序,且具有高度不确定性(例如,根据最近的一项研究,GitHubCopilot大约为60%/40%的机会)。这极大地损害了LMS的可用性,特别是在安全敏感情景中。为了应对这一限制,这项工作提出了一个新的问题,称为控制代码生成,使用户能够将布林属性输入LM,以便当LM生成安全或脆弱代码时,控制LMM。我们提议,SvGen,有效和轻量的学习方法,用于解决受控制的代码生成。SvG的特性连续矢量,不改变LM.的重量。当LM. SvG培训在不同的代码区域仔细应用专门损失条件时,优化这些连续矢量。我们的广泛评价显示,SvGen 能够实现强大的控制能力,在不同的软件和LM.LM.