平衡之道：面向LLM设计的非稳态多臂老虎机奖励的优先级策略 (Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards)

LLMs are increasingly used to design reward functions based on human preferences in Reinforcement Learning (RL). We focus on LLM-designed rewards for Restless Multi-Armed Bandits, a framework for allocating limited resources among agents. In applications such as public health, this approach empowers grassroots health workers to tailor automated allocation decisions to community needs. In the presence of multiple agents, altering the reward function based on human preferences can impact subpopulations very differently, leading to complex tradeoffs and a multi-objective resource allocation problem. We are the first to present a principled method termed Social Choice Language Model for dealing with these tradeoffs for LLM-designed rewards for multiagent planners in general and restless bandits in particular. The novel part of our model is a transparent and configurable selection component, called an adjudicator, external to the LLM that controls complex tradeoffs via a user-selected social welfare function. Our experiments demonstrate that our model reliably selects more effective, aligned, and balanced reward functions compared to purely LLM-based approaches.

翻译：在强化学习（RL）中，大型语言模型（LLMs）越来越多地用于根据人类偏好设计奖励函数。我们专注于LLM为非稳态多臂老虎机设计的奖励，这是一个在多个智能体之间分配有限资源的框架。在公共卫生等应用中，这种方法使基层卫生工作者能够根据社区需求定制自动化分配决策。在存在多个智能体的情况下，根据人类偏好改变奖励函数可能对子群体产生截然不同的影响，从而导致复杂的权衡和一个多目标资源分配问题。我们首次提出了一种称为社会选择语言模型的原则性方法，用于处理这些权衡，特别是针对多智能体规划器（尤其是非稳态老虎机）的LLM设计奖励。我们模型的新颖之处在于一个透明且可配置的选择组件，称为裁决器，它独立于LLM之外，通过用户选择的社会福利函数来控制复杂的权衡。我们的实验表明，与纯基于LLM的方法相比，我们的模型能够可靠地选择更有效、更对齐且更平衡的奖励函数。