Maximum Entropy Reinforcement Learning (MaxEnt RL) algorithms such as Soft Q-Learning (SQL) and Soft Actor-Critic trade off reward and policy entropy, which has the potential to improve training stability and robustness. Most MaxEnt RL methods, however, use a constant tradeoff coefficient (temperature), contrary to the intuition that the temperature should be high early in training to avoid overfitting to noisy value estimates and decrease later in training as we increasingly trust high value estimates to truly lead to good rewards. Moreover, our confidence in value estimates is state-dependent, increasing every time we use more evidence to update an estimate. In this paper, we present a simple state-based temperature scheduling approach, and instantiate it for SQL as Count-Based Soft Q-Learning (CBSQL). We evaluate our approach on a toy domain as well as in several Atari 2600 domains and show promising results.
翻译:最大 Entrapy Securement Learning (MaxEnt RL) 算法, 如Soft Q-Learning (SQL) 和 Soft Actor-Critical 等, 可以提高培训稳定性和稳健性。 然而,大多数 MaxEnt RL 方法都使用恒定的权衡系数( 温度 ), 与以下直觉相反, 即培训初期温度应该高, 以避免过度适应吵闹的价值估计, 并在培训后降低, 因为我们越来越相信高价值估计会真正带来良好的回报。 此外, 我们对价值估算的信心取决于国家, 每当我们使用更多证据来更新估算时, 都会增加。 在本文中, 我们提出了一个简单的基于状态的温度列表方法, 并且将它作为基于计数的 Soft Qlearce (CBSQSQL) 。 我们评估了我们在一个托伊域以及几个 Atari 2600 域的方法, 并显示有希望的结果 。