奖赏并非必要:如何为终身学习创建组成自我保护代理 (Reward is not Necessary: How to Create a Compositional Self-Preserving Agent for Life-Long Learning)

We introduce a physiological model-based agent as proof-of-principle that it is possible to define a flexible self-preserving system that does not use a reward signal or reward-maximization as an objective. We achieve this by introducing the Self-Preserving Agent (SPA) with a physiological structure where the system can get trapped in an absorbing state if the agent does not solve and execute goal-directed polices. Our agent is defined using new class of Bellman equations called Operator Bellman Equations (OBEs), for encoding jointly non-stationary non-Markovian tasks formalized as a Temporal Goal Markov Decision Process (TGMDP). OBEs produce optimal goal-conditioned spatiotemporal transition operators that map an initial state-time to the final state-times of a policy used to complete a goal, and can also be used to forecast future states in multiple dynamic physiological state-spaces. SPA is equipped with an intrinsic motivation function called the valence function, which quantifies the changes in empowerment (the channel capacity of a transition operator) after following a policy. Because empowerment is a function of a transition operator, there is a natural synergism between empowerment and OBEs: the OBEs create hierarchical transition operators, and the valence function can evaluate hierarchical empowerment change defined on these operators. The valence function can then be used for goal selection, wherein the agent chooses a policy sequence that realizes goal states which produce maximum empowerment gain. In doing so, the agent will seek freedom and avoid internal death-states that undermine its ability to control both external and internal states in the future, thereby exhibiting the capacity of predictive and anticipatory self-preservation. We also compare SPA to Multi-objective RL, and discuss its capacity for symbolic reasoning and life-long learning.

翻译：我们引入基于生理模型的代理物,作为原则的证明,即可以定义一个灵活的自我保存系统,不使用奖赏信号或奖励最大化,作为目标。我们通过引入具有生理结构的自我保存代理物(SPA),使系统能够陷入吸收状态,如果该代理物不解决和执行目标导向的策略。我们的代理物使用名为“操作者贝尔曼对等(ObEs)”的新类型的贝尔曼方程式来定义,将非静止的非马尔科维安任务联合编码成非静止的非马尔科维安任务,正式成为Temalal 目标Markov 决策程序(TGMDP ) 。OBEs产生一个最佳的、有目标约束的、有目标的、有条件的、有生理的、有目的的过渡操作者来完成目标,也可以用来预测未来的状态。 SPA的内在动机功能叫做“价值功能 ”, 用来量化权力( 过渡操作者) 在一项政策之后对权力的变化进行量化( ) ( ) (过渡操作者 ) 的渠道) 。