Reinforcement Learning formalises an embodied agent's interaction with the environment through observations, rewards and actions. But where do the actions come from? Actions are often considered to represent something external, such as the movement of a limb, a chess piece, or more generally, the output of an actuator. In this work we explore and formalize a contrasting view, namely that actions are best thought of as the output of a sequence of internal choices with respect to an action model. This view is particularly well-suited for leveraging the recent advances in large sequence models as prior knowledge for multi-task reinforcement learning problems. Our main contribution in this work is to show how to augment the standard MDP formalism with a sequential notion of internal action using information-theoretic techniques, and that this leads to self-consistent definitions of both internal and external action value functions.
翻译:强化学习通过观察、 奖赏和动作将体现的代理物与环境的相互作用正规化。 但是,行动来自何处? 行动通常被视为代表外部的东西, 如肢体运动、象棋片, 或者更一般地说, 动画师的输出。 在这项工作中,我们探索并正式确定一种截然不同的观点, 即将行动最好视为行动模式方面一系列内部选择的结果。 这种观点特别适合于利用大型序列模型中的最新进展, 将其作为多任务强化学习问题的先行知识。 我们在这项工作中的主要贡献是展示如何利用信息理论技术, 以内部行动的顺序概念来强化标准的 MDP 形式主义, 从而导致内部和外部行动价值功能的自相容定义 。