We investigate the effect of using human demonstration data in the replay buffer for Deep Reinforcement Learning. We use a policy gradient method with a modified experience replay buffer where a human demonstration experience is sampled with a given probability. We analyze different ratios of using demonstration data in a task where an agent attempts to reach a goal while avoiding obstacles. Our results suggest that while the agents trained by pure self-exploration and pure demonstration had similar success rates, the pure demonstration model converged faster to solutions with less number of steps.
翻译:我们调查了在深层强化学习的回放缓冲中使用人类示范数据的效果。 我们使用政策梯度方法, 其经修改的经验缓冲回放, 以给定概率对人类的演示经验进行抽样。 我们分析了在任务中使用演示数据的不同比例, 在任务中, 代理试图在避免障碍的同时达到一个目标。 我们的结果表明, 接受纯自我探索和纯示范培训的代理商的成功率相似, 纯示范模型更快地集中到解决方案中, 步骤较少。