Markov Decision Process (MDP) presents a mathematical framework to formulate the learning processes of agents in reinforcement learning. MDP is limited by the Markovian assumption that a reward only depends on the immediate state and action. However, a reward sometimes depends on the history of states and actions, which may result in the decision process in a non-Markovian environment. In such environments, agents receive rewards via temporally-extended behaviors sparsely, and the learned policies may be similar. This leads the agents acquired with similar policies generally overfit to the given task and can not quickly adapt to perturbations of environments. To resolve this problem, this paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment, in which a policy dispersion scheme is designed for seeking diverse policy representation. Specifically, we first adopt a transformer-based method to learn policy embeddings. Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies. Finally, we prove that if the dispersion matrix is positive definite, the dispersed embeddings can effectively enlarge the disagreements across policies, yielding a diverse expression for the original policy embedding distribution. Experimental results show that this dispersion scheme can obtain more expressive diverse policies, which then derive more robust performance than recent learning baselines under various learning environments.
翻译:Markov 决策程序(MDP) 提供了一个数学框架, 用以制定代理人在强化学习过程中的学习过程。 MDP受到Markovian 的以下假设的限制: 奖励只取决于眼前的状态和行动。 但是, 奖励有时取决于国家的历史和行动, 这可能导致在非马尔科维安环境中的决策过程。 在这种环境中, 代理人通过时间延伸的行为得到奖励, 学习的政策可能相似。 这导致以类似政策获得的代理人获得的类似政策通常与给定的任务不相适应, 并且无法迅速适应环境的动荡。 为了解决这个问题, 本文试图从非马尔科维安环境中的国家行动对口的历史中学习不同的政策, 而在非马尔科维安环境下, 设计政策分散计划的目的是寻求多样化的政策代表。 具体地说, 我们首先采用基于变异器的方法来学习政策嵌嵌嵌。 然后, 我们堆叠政策嵌在一起以构建一个分散矩阵, 以诱导出一套多样化的政策。 最后, 我们证明, 如果分散的矩阵是肯定的, 分散嵌入可以有效地扩大最近的政策之间的分歧,, 使这种明确的政策在不同的政策中产生一种更加多样化的实验性的环境。</s>