We study goal misgeneralization, a type of out-of-distribution generalization failure in reinforcement learning (RL). Goal misgeneralization failures occur when an RL agent retains its capabilities out-of-distribution yet pursues the wrong goal. For instance, an agent might continue to competently avoid obstacles, but navigate to the wrong place. In contrast, previous works have typically focused on capability generalization failures, where an agent fails to do anything sensible at test time. We formalize this distinction between capability and goal generalization, provide the first empirical demonstrations of goal misgeneralization, and present a partial characterization of its causes.
翻译:我们研究目标的概括性不当,这是一种在强化学习中传播外的概括性失灵(RL),目标的错误概括性失灵发生在一个RL代理机构保留其能力在分配外,但追求的目标却是错误的。例如,一个代理机构可能继续有能力地避免障碍,但前往错误的地方。相反,以前的工作通常侧重于能力概括性失灵,即一个代理机构在测试时没有做任何明智的事情。我们正式确定能力和目标概括性之间的这种区分,首次从经验上证明目标的概括性错误,并部分描述其原因。