Novel AI-based code-writing Large Language Models (LLMs) such as OpenAI's Codex have demonstrated capabilities in many coding-adjacent domains. In this work we consider how LLMs maybe leveraged to automatically repair security relevant bugs present in hardware designs. We focus on bug repair in code written in the Hardware Description Language Verilog. For this study we build a corpus of domain-representative hardware security bugs. We then design and implement a framework to quantitatively evaluate the performance of any LLM tasked with fixing the specified bugs. The framework supports design space exploration of prompts (i.e., prompt engineering) and identifying the best parameters for the LLM. We show that an ensemble of LLMs can repair all ten of our benchmarks. This ensemble outperforms the state-of-the-art Cirfix hardware bug repair tool on its own suite of bugs. These results show that LLMs can repair hardware security bugs and the framework is an important step towards the ultimate goal of an automated end-to-end bug repair framework.
翻译:OpenAI's codex 等基于AI 代码的大型语言模型(LLMs) 已经在许多编码相邻域中展示了能力。 在此工作中, 我们考虑LLMs如何可能利用LLMs来自动修理硬件设计中存在的与安全有关的错误。 我们侧重于硬件描述语言 Verilog 中写入的代码中的错误修复。 我们为此项研究建立了一组具有域名代表性的硬件安全错误。 然后我们设计和实施一个框架, 以量化地评价任何负责修复指定错误的LLM的功能。 框架支持空间探索提示( 即快速工程), 并确定LLMM 的最佳参数。 我们显示, LLMs 的组合可以修复我们全部的十项基准。 这串联超越了在自己的臭虫包上最先进的Cirfix 硬件错误修复工具。 这些结果表明, LLMS 可以修复硬件安全错误, 而框架是迈向自动终端错误修复框架最终目标的重要一步。