Exploration in high-dimensional, continuous spaces with sparse rewards is an open problem in reinforcement learning. Artificial curiosity algorithms address this by creating rewards that lead to exploration. Given a reinforcement learning algorithm capable of maximizing rewards, the problem reduces to finding an optimization objective consistent with exploration. Maximum entropy exploration uses the entropy of the state visitation distribution as such an objective. However, efficiently estimating the entropy of the state visitation distribution is challenging in high-dimensional, continuous spaces. We introduce an artificial curiosity algorithm based on lower bounding an approximation to the entropy of the state visitation distribution. The bound relies on a result we prove for non-parametric density estimation in arbitrary dimensions using k-means. We show that our approach is both computationally efficient and competitive on benchmarks for exploration in high-dimensional, continuous spaces, especially on tasks where reinforcement learning algorithms are unable to find rewards.
翻译:高维、连续空间的探索和微薄的回报是强化学习的公开问题。人工好奇心算法通过创造导致探索的奖赏来解决这个问题。在能够实现最大奖赏最大化的强化学习算法下,问题被降低为找到与探索相符的优化目标。最大恒温勘探使用国家访问分布的微粒作为这个目标。然而,高效估计国家访问分布的微粒在高维、连续空间是困难的。我们引入了人工好奇心算法,其依据是将近似线下至国家访问分布的微粒。我们所依赖的结果是,我们证明在使用 k 手段任意尺寸的非参数密度估计是非参数性的。我们显示,我们的方法在高维、连续空间的勘探基准上是计算效率和竞争性的,特别是在强化学习算法无法找到回报的任务上。