In Reinforcement Learning (RL), discrete actions, as opposed to continuous actions, result in less complex exploration problems and the immediate computation of the maximum of the action-value function which is central to dynamic programming-based methods. In this paper, we propose a novel method: Action Quantization from Demonstrations (AQuaDem) to learn a discretization of continuous action spaces by leveraging the priors of demonstrations. This dramatically reduces the exploration problem, since the actions faced by the agent not only are in a finite number but also are plausible in light of the demonstrator's behavior. By discretizing the action space we can apply any discrete action deep RL algorithm to the continuous control problem. We evaluate the proposed method on three different setups: RL with demonstrations, RL with play data --demonstrations of a human playing in an environment but not solving any specific task-- and Imitation Learning. For all three setups, we only consider human data, which is more challenging than synthetic data. We found that AQuaDem consistently outperforms state-of-the-art continuous control methods, both in terms of performance and sample efficiency. We provide visualizations and videos in the paper's website: https://google-research.github.io/aquadem.
翻译:强化学习( RL) 中, 与连续行动相比, 分离的行动, 与连续行动相比, 导致较不复杂的探索问题, 并立即计算行动价值最大功能的最大值, 这对于动态编程方法至关重要。 在本文中, 我们提出了一个新颖的方法 : 从演示( AQuaDem) 中量化行动, 通过利用演示的先行来学习连续行动空间的分解 。 这极大地减少了探索问题, 因为代理方所面临的行动不仅数量有限, 而且从演示人的行为来看也是可行的。 通过将行动空间分解, 我们可以对连续控制问题应用任何深度的分解行动 RL 算法 。 我们评估了三种不同设置的拟议方法 : 演示 RL 、 演示( RL) 、 播放数据 -- 显示人类在环境中玩耍, 但不解决任何特定任务和仿真学习 。 对于所有这三种设置, 我们只考虑人类数据, 这比合成数据更具挑战性。 我们发现 AuaDeam 始终超越了持续控制状态的状态, 持续控制方法, 我们在视觉/ 图像中提供 图像/ 样本 。