Episodic self-imitation learning, a novel self-imitation algorithm with a trajectory selection module and an adaptive loss function, is proposed to speed up reinforcement learning. Compared to the original self-imitation learning algorithm, which samples good state-action pairs from the experience replay buffer, our agent leverages entire episodes with hindsight to aid self-imitation learning. A selection module is introduced to filter uninformative samples from each episode of the update. The proposed method overcomes the limitations of the standard self-imitation learning algorithm, a transitions-based method which performs poorly in handling continuous control environments with sparse rewards. From the experiments, episodic self-imitation learning is shown to perform better than baseline on-policy algorithms, achieving comparable performance to state-of-the-art off-policy algorithms in several simulated robot control tasks. The trajectory selection module is shown to prevent the agent learning undesirable hindsight experiences. With the capability of solving sparse reward problems in continuous control settings, episodic self-imitation learning has the potential to be applied to real-world problems that have continuous action spaces, such as robot guidance and manipulation.
翻译:Episod 自我缩进学习,是一种带有轨迹选择模块和适应性损失功能的新型自我缩进算法,旨在加速强化学习。与最初的自我缩进学习算法相比,我们的代理商用事后观察来利用整个过程来帮助自我缩进学习。引入了一个选择模块来过滤每个更新插件的非信息样本。拟议方法克服了标准自我缩进学习算法的局限性,这种过渡性方法在以稀有的奖励处理连续控制环境方面表现不佳。从实验中可以看出,缩进式自我缩进学习比基线政策算法表现得更好,在一些模拟机器人控制任务中取得了与最新非政策算法的相似的性能。轨迹选择模块可以防止代理商学习不可取的自我缩进体验。在连续控制环境中解决稀有的奖赏问题的能力,缩进式自我缩学习具有潜力,可以应用到具有连续操作空间、像机器人这样的机器人操纵等实际世界问题。