在无酬学习方面进行勘探和优惠满意程度抵偿权衡取舍 (Exploration and preference satisfaction trade-off in reward-free learning)

Biological agents have meaningful interactions with their environment despite the absence of immediate reward signals. In such instances, the agent can learn preferred modes of behaviour that lead to predictable states -- necessary for survival. In this paper, we pursue the notion that this learnt behaviour can be a consequence of reward-free preference learning that ensures an appropriate trade-off between exploration and preference satisfaction. For this, we introduce a model-based Bayesian agent equipped with a preference learning mechanism (pepper) using conjugate priors. These conjugate priors are used to augment the expected free energy planner for learning preferences over states (or outcomes) across time. Importantly, our approach enables the agent to learn preferences that encourage adaptive behaviour at test time. We illustrate this in the OpenAI Gym FrozenLake and the 3D mini-world environments -- with and without volatility. Given a constant environment, these agents learn confident (i.e., precise) preferences and act to satisfy them. Conversely, in a volatile setting, perpetual preference uncertainty maintains exploratory behaviour. Our experiments suggest that learnable (reward-free) preferences entail a trade-off between exploration and preference satisfaction. Pepper offers a straightforward framework suitable for designing adaptive agents when reward functions cannot be predefined as in real environments.

翻译：尽管没有直接的奖赏信号,生物剂与环境有着有意义的互动关系。在这种情况下,生物剂可以学习导致可以预测的国家 -- -- 生存所必需的 -- -- 的首选行为模式。在本文中,我们追求的理念是,这种学习的行为可能是无报酬的特惠学习的结果,确保勘探和特惠满意度之间的适当权衡。为此,我们引入了一种基于模型的贝叶西亚剂,配有优待学习机制(pepper),使用合金先质。这些同源先质被用来增加预期的免费能源规划者,以便学习对国家(或结果)的长期偏好。重要的是,我们的方法使代理人能够学习鼓励在测试时适应行为的偏好。我们在OpenAI Gym FrozenLake 和 3D 迷你世界环境中展示了这一点 -- -- 并且没有波动性。在持续的环境中,这些代理人学会了自信(即精准)的偏好和满足这些偏好。相反,在不稳定的环境中,永久的偏好会保持探索行为。我们的实验表明,可以学习的(相对自由)偏好的偏好的偏好需要在探索和偏好之间的交易。我们在真正的满意度上,当设计一个适合的适应性环境时,不能成为合适的框架。