To learn directed behaviors in complex environments, intelligent agents need to optimize objective functions. Various objectives are known for designing artificial agents, including task rewards and intrinsic motivation. However, it is unclear how the known objectives relate to each other, which objectives remain yet to be discovered, and which objectives better describe the behavior of humans. We introduce the Action Perception Divergence (APD), an approach for categorizing the space of possible objective functions for embodied agents. We show a spectrum that reaches from narrow to general objectives. While the narrow objectives correspond to domain-specific rewards as typical in reinforcement learning, the general objectives maximize information with the environment through latent variable models of input sequences. Intuitively, these agents use perception to align their beliefs with the world and use actions to align the world with their beliefs. They infer representations that are informative of past inputs, explore future inputs that are informative of their representations, and select actions or skills that maximally influence future inputs. This explains a wide range of unsupervised objectives from a single principle, including representation learning, information gain, empowerment, and skill discovery. Our findings suggest leveraging powerful world models for unsupervised exploration as a path toward highly adaptive agents that seek out large niches in their environments, rendering task rewards optional.
翻译:为了在复杂环境中学习定向行为,智能代理人需要优化客观功能。设计人工代理的各种目标,包括任务奖赏和内在动机,是众所周知的。然而,尚不清楚已知目标彼此之间有何关联,哪些目标仍有待发现,哪些目标更能描述人类的行为。我们引入了“行动认知差异”(APD),这是对被显示代理人可能客观功能的空间进行分类的一种方法。我们展示了从狭义到一般目标的频谱。虽然狭隘目标与强化学习中典型的特定领域奖励相对应,但一般目标通过潜在的投入序列变异模型最大限度地利用环境信息。从直觉上看,这些代理人利用认知使其信仰与世界保持一致,并利用行动使其世界与其信仰相一致。他们推断出对过去投入有丰富内容的表述,探索其表述内容的未来投入,并选择对未来投入产生最大影响的行动或技能。这解释了从单一原则(包括代表性学习、信息获取、赋权和技能发现)中产生广泛不受控制的目标,包括代表性学习、信息获取、技能发现。我们的发现结论表明,在不严密的世界模型中利用强势世界模型将自身定位作为高度适应性的定位的定位,从而追求高度适应性任务。