在无酬学习方面进行勘探和优惠满意程度抵偿权衡取舍 (Exploration and preference satisfaction trade-off in reward-free learning)

Biological agents have meaningful interactions with their environment despite the absence of a reward signal. In such instances, the agent can learn preferred modes of behaviour that lead to predictable states -- necessary for survival. In this paper, we pursue the notion that this learnt behaviour can be a consequence of reward-free preference learning that ensures an appropriate trade-off between exploration and preference satisfaction. For this, we introduce a model-based Bayesian agent equipped with a preference learning mechanism (pepper) using conjugate priors. These conjugate priors are used to augment the expected free energy planner for learning preferences over states (or outcomes) across time. Importantly, our approach enables the agent to learn preferences that encourage adaptive behaviour at test time. We illustrate this in the OpenAI Gym FrozenLake and the 3D mini-world environments -- with and without volatility. Given a constant environment, these agents learn confident (i.e., precise) preferences and act to satisfy them. Conversely, in a volatile setting, perpetual preference uncertainty maintains exploratory behaviour. Our experiments suggest that learnable (reward-free) preferences entail a trade-off between exploration and preference satisfaction. Pepper offers a straightforward framework suitable for designing adaptive agents when reward functions cannot be predefined as in real environments.

翻译：尽管没有奖赏信号,生物剂与环境之间有着有意义的互动关系。在这种情况下,生物剂可以学习导致可预测国家 -- -- 生存所必需的 -- -- 的首选行为模式。在本文中,我们追求的理念是,这种学习的行为可能是无报酬的特惠学习的结果,可以确保勘探和特惠满意度之间的适当权衡。为此,我们引入了一种基于模型的贝叶西亚剂,配有优待学习机制(pepper),使用相近的先质。这些同质先质被用来增加预期的免费能源规划者,以便长期学习对国家(或结果)的偏好。重要的是,我们的方法使代理人能够学习鼓励在测试时适应行为的偏好。我们在OpenAI Gym FrozenLake 和 3D 微型世界环境中对此做了说明,这种学习可以确保勘探和无波动地进行适当的权衡。在持续的环境中,这些代理人学会自信(即准确的)优待和满足这些优待。相反,在动荡的环境中,永久的优惠不确定性会保持探索性的行为。我们的实验表明,可以学习的(免费)偏好的偏好的偏好的偏好在实际满意度上需要一种贸易。我们用。在真正的满意度上不能提供一个直接的奖赏环境。