Many Deep Reinforcement Learning (D-RL) algorithms rely on simple forms of exploration such as the additive action noise often used in continuous control domains. Typically, the scaling factor of this action noise is chosen as a hyper-parameter and is kept constant during training. In this paper, we focus on action noise in off-policy deep reinforcement learning for continuous control. We analyze how the learned policy is impacted by the noise type, noise scale, and impact scaling factor reduction schedule. We consider the two most prominent types of action noise, Gaussian and Ornstein-Uhlenbeck noise, and perform a vast experimental campaign by systematically varying the noise type and scale parameter, and by measuring variables of interest like the expected return of the policy and the state-space coverage during exploration. For the latter, we propose a novel state-space coverage measure $\operatorname{X}_{\mathcal{U}\text{rel}}$ that is more robust to boundary artifacts than previously-proposed measures. Larger noise scales generally increase state-space coverage. However, we found that increasing the space coverage using a larger noise scale is often not beneficial. On the contrary, reducing the noise scale over the training process reduces the variance and generally improves the learning performance. We conclude that the best noise type and scale are environment dependent, and based on our observations derive heuristic rules for guiding the choice of the action noise as a starting point for further optimization.
翻译:许多深加力学习(D-RL)算法依靠简单的勘探形式,例如经常在连续控制域中使用的添加行动噪音等。 通常, 动作噪音的缩放因子被选为超参数, 并在训练期间保持不变 。 在本文中, 我们侧重于在政策外深加力学习中的动作噪音, 以持续控制。 我们分析所学政策如何受到噪音类型、 噪音规模和影响缩放因子削减时间表的影响 。 我们考虑两种最突出的行动噪音类型, 高西亚和奥恩斯坦- 乌赫伦贝克的噪音, 并进行大规模实验运动, 系统地改变噪音类型和比例参数, 测量政策预期回报和州- 空间覆盖范围等利益变量 。 对于后者, 我们提出一个新的州- 空间覆盖度措施 $\ operatorname{Xáuthcal{U ⁇ text{rel ⁇ } 和 缩放因对边界工艺比先前提出的措施更强。 更大的噪音规模一般会增加州- 空间覆盖范围。 然而, 我们发现, 扩大空间覆盖空间范围, 扩大空间范围, 使用更大规模的 使用更大规模的噪音定位定位定位定位导航级的定位观测 通常不会有利于 。