Learning a control policy capable of adapting to time-varying and potentially evolving system dynamics has been a great challenge to the mainstream reinforcement learning (RL). Mainly, the ever-changing system properties would continuously affect how the RL agent interacts with the state space through its actions, which effectively (re-)introduces concept drifts to the underlying policy learning process. We postulated that higher adaptability for the control policy can be achieved by characterizing and representing actions with extra "degrees of freedom" and thereby, with greater flexibility, adjusts to variations from the action's "behavioral" outcomes, including how these actions get carried out in real time and the shift in the action set itself. This paper proposes a Bayesian-flavored generalized RL framework by first establishing the notion of parametric action model to better cope with uncertainty and fluid action behaviors, followed by introducing the notion of reinforcement field as a physics-inspired construct established through "polarized experience particles" maintained in the RL agent's working memory. These particles effectively encode the agent's dynamic learning experience that evolves over time in a self-organizing way. Using the reinforcement field as a substrate, we will further generalize the policy search to incorporate high-level decision concepts by viewing the past memory as an implicit graph structure, in which the memory instances, or particles, are interconnected with their degrees of associability/similarity defined and quantified such that the "associative memory" principle can be consistently applied to establish and augment the learning agent's evolving world model.
翻译:学习能够适应时间变化和潜在演变的系统动态的控制政策对主流强化学习(RL)是一个巨大的挑战。 最主要的是,不断变化的系统属性将持续影响RL代理商如何通过其行动与国家空间互动,这有效地(再)引入概念漂移到基本的政策学习过程。 我们假设,对控制政策的更大适应性可以通过以额外的“自由度”描述和代表行动来实现,从而更加灵活地适应行动“行为”结果的变化,包括这些行动如何在实际时间中进行,以及行动本身的变化。 本文提议了一个由Bayesianflavelor化的通用RL框架,首先建立参数行动模型的概念,以更好地应对不确定性和流动性行动行为,然后提出“加强领域”的概念,作为物理学的启发性结构,由RL代理商的工作记忆中保存的“极化经验粒子”建立。 这些粒子有效地编码了代理人动态学习经验的变异过程,在实际时间里进行,以及行动本身的变换变化。 本文提出一个通用的通用的RLL框架框架框架,首先建立参数, 以不断的内化的内化的内径化的内径化的内径化的内存结构,, 将进一步的内径化的内径化的内径化的内化的内化的内化的内化的内化的内化的内存结构,, 将进一步的内化的内化的内化的内化的内化的内化的内化的内存的内化的内化的内化的内存结构化的内存结构化的内存, 。