We revisit the role of instrumental value as a driver of adaptive behavior. In active inference, instrumental or extrinsic value is quantified by the information-theoretic surprisal of a set of observations measuring the extent to which those observations conform to prior beliefs or preferences. That is, an agent is expected to seek the type of evidence that is consistent with its own model of the world. For reinforcement learning tasks, the distribution of preferences replaces the notion of reward. We explore a scenario in which the agent learns this distribution in a self-supervised manner. In particular, we highlight the distinction between observations induced by the environment and those pertaining more directly to the continuity of an agent in time. We evaluate our methodology in a dynamic environment with discrete time and actions. First with a surprisal minimizing model-free agent (in the RL sense) and then expanding to the model-based case to minimize the expected free energy.
翻译:我们重新审视了工具价值作为适应行为驱动因素的作用。在积极的推论中,工具价值或外部价值通过一套测量这些观测符合先前信仰或偏好程度的一组观测的信息理论假设值量化。也就是说,一个代理商应当寻求与其自己的世界模式相一致的证据类型。为了强化学习任务,优惠分配取代奖励概念。我们探索了一种假设,即代理人以自我监督的方式了解这种分布。我们特别强调了由环境引起的观测与更直接地与代理人的连续性有关的观测之间的区别。我们评估了我们在一个动态环境中采用的方法,并有不固定的时间和行动。首先,将无模型的替代剂(在RL意义上),然后扩展到基于模型的案例中,以尽量减少预期的自由能源。