Exploration is critical for deep reinforcement learning in complex environments with high-dimensional observations and sparse rewards. To address this problem, recent approaches proposed to leverage intrinsic rewards to improve exploration, such as novelty-based exploration and prediction-based exploration. However, many intrinsic reward modules require sophisticated structures and representation learning, resulting in prohibitive computational complexity and unstable performance. In this paper, we propose Rewarding Episodic Visitation Discrepancy (REVD), a computation-efficient and quantified exploration method. More specifically, REVD provides intrinsic rewards by evaluating the R\'enyi divergence-based visitation discrepancy between episodes. To make efficient divergence estimation, a k-nearest neighbor estimator is utilized with a randomly-initialized state encoder. Finally, the REVD is tested on PyBullet Robotics Environments and Atari games. Extensive experiments demonstrate that REVD can significantly improves the sample efficiency of reinforcement learning algorithms and outperforms the benchmarking methods.
翻译:为了解决这一问题,最近提议了一些办法,以利用内在奖励来改进勘探,例如以新颖的勘探和预测为基础的勘探。然而,许多内在奖励模块需要复杂的结构和代表性学习,从而导致令人望而却步的计算复杂性和不稳定的性能。在本论文中,我们提议奖励访问差异(REVD),这是一种具有计算效率和量化的探索方法。更具体地说,REVD通过评价R\'enyi基于差异的访问差异差异来提供内在奖励。为了作出有效的差异估计,使用K-最接近的邻居估计器和一个随机的初始状态编码器。最后,REVD是在PyBullet机器人环境与Atari游戏上测试的。广泛的实验表明,REVD可以大大提高强化学习算法的抽样效率,并超越基准方法。