Learning a control policy that involves time-varying and evolving system dynamics often poses a great challenge to mainstream reinforcement learning algorithms. In most standard methods, actions are often assumed to be a rigid, fixed set of choices that are sequentially applied to the state space in a predefined manner. Consequently, without resorting to substantial re-learning processes, the learned policy lacks the ability in adapting to variations in the action set and the action's "behavioral" outcomes. In addition, the standard action representation and the action-induced state transition mechanism inherently limit how reinforcement learning can be applied in complex, real-world applications primarily due to the intractability of the resulting large state space and the lack of facility to generalize the learned policy to the unknown part of the state space. This paper proposes a Bayesian-flavored generalized reinforcement learning framework by first establishing the notion of parametric action model to better cope with uncertainty and fluid action behaviors, followed by introducing the notion of reinforcement field as a physics-inspired construct established through "polarized experience particles" maintained in the learning agent's working memory. These particles effectively encode the dynamic learning experience that evolves over time in a self-organizing way. On top of the reinforcement field, we will further generalize the policy learning process to incorporate high-level decision concepts by considering the past memory as having an implicit graph structure, in which the past memory instances (or particles) are interconnected with similarity between decisions defined, and thereby, the "associative memory" principle can be applied to augment the learning agent's world model.
翻译:学习一项需要时间变化和系统动态演变的控制政策往往对将强化学习算法纳入主流提出巨大挑战。 在大多数标准方法中,行动往往被假定为一套僵硬、固定的选择,以预先确定的方式依次适用于国家空间。因此,在不诉诸大量再学习进程的情况下,学习的政策缺乏适应行动组合和行动的“行为”结果变化的能力。此外,标准行动代表制和由行动引起的国家过渡机制必然限制如何在复杂的、现实世界应用中应用强化学习,这主要是因为由此产生的大量国家空间不易吸引,而且缺乏设施将学习政策推广到国家空间的未知部分。本文建议采用一个贝耶斯式的普及强化学习框架,首先建立模拟行动模型的概念,以更好地应对不确定性和流动性行动行为的结果。此外,引入“强化领域”的概念,作为通过学习代理商的工作记忆中“极化经验粒子”建立的物理学启发性构建,这些粒子可以有效地将动态学习政策推广到历史空间的未知部分。 本文提出一个动态学习常识化的常识化的理论, 将“我们从历史的内深层次上演化的内化的理论, ” 进进进进进到历史的场, 进进进进进进进进进进进进进进进进进进进进进进进进进进进,, 进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进的地球的地球的阶段, 进的阶段, 进, 进进进进的阶段, 进, 进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进进的进的进的进的进的进的进的进的进的进的进的进的进的进的进的进进进的进进进进进进进进进进进进进进进进进的进的进进进进进进进进进进进进进进进进进进进进进进进的进进的进的进的进进进进进进进进进进进的