Policies for partially observed Markov decision processes can be efficiently learned by imitating policies for the corresponding fully observed Markov decision processes. Unfortunately, existing approaches for this kind of imitation learning have a serious flaw: the expert does not know what the trainee cannot see, and so may encourage actions that are sub-optimal, even unsafe, under partial information. We derive an objective to instead train the expert to maximize the expected reward of the imitating agent policy, and use it to construct an efficient algorithm, adaptive asymmetric DAgger (A2D), that jointly trains the expert and the agent. We show that A2D produces an expert policy that the agent can safely imitate, in turn outperforming policies learned by imitating a fixed expert.
翻译:部分观测到的Markov决策程序的政策可以通过模仿相应的充分观察的Markov决策程序的政策来有效学习。 不幸的是,目前这种模仿学习的方法存在严重的缺陷:专家不知道受训人看不到什么,因此鼓励一些行动,这些行动不够理想,甚至是不安全的,只是部分信息。 我们的目标是培训专家尽量扩大模仿剂政策的预期回报,并利用它构建一种高效算法,即适应性不对称的达格(A2D),联合培训专家和代理人。我们证明A2D产生了一种专家政策,该代理人可以安全地模仿,从而产生模仿固定专家所学到的绩效政策。