Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising approach to improve correctness in LLMs, however, in many scientific problems, the objective is not necessarily to produce the correct answer, but instead to produce a diverse array of candidates which satisfy a set of constraints. We study this challenge in the context of materials generation. To this end, we introduce PLaID++, an LLM post-trained for stable and property-guided crystal generation. We find that performance hinges on our crystallographic representation and reward formulation. First, we introduce a compact, symmetry-informed Wyckoff text representation which improves computational efficiency and encourages generalization from physical priors. Second, we demonstrate that temperature scaling acts as an entropy regularizer which counteracts mode collapse and encourages exploration. By encoding symmetry constraints directly into text and guiding model outputs towards desirable chemical space, PLaID++ generates structures that are thermodynamically stable, unique, and novel at a $\sim$50\% greater rate than prior methods and conditionally generates structures with desired space group properties. Our work demonstrates the potential of adapting post-training techniques from natural language processing to materials design, paving the way for targeted and efficient discovery of novel materials.
翻译:可验证奖励强化学习(RLVR)已成为提升大型语言模型(LLM)正确性的有前景方法,然而在许多科学问题中,目标未必是产生正确答案,而是生成满足一系列约束条件的多样化候选方案。我们在材料生成的背景下研究这一挑战。为此,我们提出了PLaID++,这是一种经过后训练、用于稳定且属性引导的晶体生成的LLM。我们发现性能表现取决于我们的晶体学表示方法与奖励构建机制。首先,我们引入了一种紧凑的、对称性感知的Wyckoff文本表示法,该方法提升了计算效率,并通过物理先验促进了泛化能力。其次,我们证明温度缩放可作为熵正则化器,有效抑制模式崩溃并鼓励探索。通过将对称性约束直接编码至文本中,并将模型输出引导至理想的化学空间,PLaID++能以比现有方法高出约50%的速率生成热力学稳定、独特且新颖的结构,并能条件性地生成具有所需空间群属性的结构。我们的工作展示了将自然语言处理中的后训练技术适配至材料设计领域的潜力,为定向高效发现新型材料开辟了道路。