Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code-NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage "distill-then-RL" training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~20%, while even exceeding the performance of 671B DeepSeek-R1 on RTLLM. We have released our model, training code, and dataset to facilitate research in EDA and LLM communities.
翻译:通过可验证奖励的强化学习(RLVR)训练的大语言模型(LLM)已在具有明确、可自动化验证的任务上取得突破,例如软件编程和数学问题。然而,将RLVR扩展到电子设计自动化(EDA)领域,特别是从自然语言(NL)规范自动生成硬件描述语言(HDL)如Verilog,面临三个关键挑战:缺乏自动化且准确的验证环境、高质量NL-代码对的稀缺性,以及RLVR高昂的计算成本。为此,我们提出了CodeV-R1,一个用于训练Verilog生成LLM的RLVR框架。首先,我们开发了一个基于规则的测试平台生成器,可针对黄金参考进行鲁棒的等价性检查。其次,我们提出了一种往返数据合成方法,将开源Verilog代码片段与LLM生成的NL描述配对,通过生成的测试平台验证代码-NL-代码的一致性,并过滤掉不等价的样本,从而产生高质量的数据集。第三,我们采用了两阶段“蒸馏后强化学习”的训练流程:首先通过蒸馏实现推理能力的冷启动,随后采用我们新颖的RLVR算法——自适应DAPO,该算法可通过自适应调整采样率来降低训练成本。最终得到的模型CodeV-R1-7B,在VerilogEval v2和RTLLM v1.1上分别达到了68.6%和72.9%的pass@1,比先前最先进模型高出12~20%,甚至在RTLLM上的表现超过了671B参数的DeepSeek-R1。我们已经发布了模型、训练代码和数据集,以促进EDA和LLM社区的研究。