KL-regularized reinforcement learning (RL) is a popular alignment framework to control the language model responses towards high reward outcomes. We propose a modular solver for this RL objective, called controlled decoding (CD), which exerts control through a separate prefix scorer module. At training time, the prefix scorer learns a value function for the reward, and it is used at inference time to control the generation from a frozen base model, provably sampling from a solution to the RL objective. We empirically demonstrate that CD is effective as a control mechanism on popular benchmarks. We also show that a single prefix scorer can learn multiple rewards and different reward combinations can be configurable at inference time, effectively solving a multi-objective RL problem with no additional training. We show that the benefits of applying CD transfer to an unseen base model with no further tuning. Finally, we show that CD can be applied in a blockwise decoding fashion at inference-time, essentially bridging the gap between the popular best-of-$n$ strategy and token-level control through reinforcement learning. This makes CD a promising approach for alignment of language models.
翻译:暂无翻译