Reinforcement learning (RL) for continuous control typically employs distributions whose support covers the entire action space. In this work, we investigate the colloquially known phenomenon that trained agents often prefer actions at the boundaries of that space. We draw theoretical connections to the emergence of bang-bang behavior in optimal control, and provide extensive empirical evaluation across a variety of recent RL algorithms. We replace the normal Gaussian by a Bernoulli distribution that solely considers the extremes along each action dimension - a bang-bang controller. Surprisingly, this achieves state-of-the-art performance on several continuous control benchmarks - in contrast to robotic hardware, where energy and maintenance cost affect controller choices. Since exploration, learning,and the final solution are entangled in RL, we provide additional imitation learning experiments to reduce the impact of exploration on our analysis. Finally, we show that our observations generalize to environments that aim to model real-world challenges and evaluate factors to mitigate the emergence of bang-bang solutions. Our findings emphasize challenges for benchmarking continuous control algorithms, particularly in light of potential real-world applications.
翻译:连续控制的强化学习( RL) 通常使用包含整个行动空间的支持的分布。 在这项工作中,我们调查了学术上已知的、受过训练的代理人往往更喜欢在该空间边界内采取行动的现象。 我们从理论上联系到最佳控制的砰砰行为,并对最近各种RL算法提供了广泛的实证评估。 我们用伯努利的分布来取代正常的Gaussian, 该分布只考虑每个行动层面的极端因素, 一个砰砰砰控制器。 令人惊讶的是, 与机器人硬件相比,这在几个连续控制基准上取得了最先进的性能, 能源和维护成本影响到控制器的选择。 由于勘探、学习和最终解决方案都与RL纠缠在一起, 我们提供了更多的模拟学习实验, 以减少探索对我们分析的影响。 最后,我们展示了我们观察的概观环境,目的是模拟现实世界的挑战,评估减轻爆发解决方案出现的因素。 我们的发现强调了对持续控制算法的基准化的挑战, 特别是考虑到潜在的现实世界应用。