Reinforcement learning is a promising paradigm for learning robot control, allowing complex control policies to be learned without requiring a dynamics model. However, even state of the art algorithms can be difficult to tune for optimum performance. We propose employing an ensemble of multiple reinforcement learning agents, each with a different set of hyperparameters, along with a mechanism for choosing the best performing set(s) on-line. In the literature, the ensemble technique is used to improve performance in general, but the current work specifically addresses decreasing the hyperparameter tuning effort. Furthermore, our approach targets on-line learning on a single robotic system, and does not require running multiple simulators in parallel. Although the idea is generic, the Deep Deterministic Policy Gradient was the model chosen, being a representative deep learning actor-critic method with good performance in continuous action settings but known high variance. We compare our online weighted q-ensemble approach to q-average ensemble strategies addressed in literature using alternate policy training, as well as online training, demonstrating the advantage of the new approach in eliminating hyperparameter tuning. The applicability to real-world systems was validated in common robotic benchmark environments: the bipedal robot half cheetah and the swimmer. Online Weighted Q-Ensemble presented overall lower variance and superior results when compared with q-average ensembles using randomized parameterizations.
翻译:强化学习是学习机器人控制的一个很有希望的范例,可以学习复杂的控制政策,而不需要动态模型。然而,即使艺术算法的状态也很难调和,以取得最佳性能。我们提议使用多种强化学习剂的组合体,每个体各有一套不同的超参数,同时有一个机制选择最优秀的在线成套功能。在文献中,混合技术用于提高总体性能,但当前工作具体针对的是减少超参数调力。此外,我们的方法目标是在单一机器人系统中在线学习,不需要同时运行多个模拟器。尽管这个想法是通用的,但深确定性政策梯度是所选择的模式,是一种具有代表性的深度学习的行为者-激励方法,在连续行动环境中表现良好,但已知差异很大。我们用我们在线加权的共变数方法与在文献中使用替代政策培训以及在线培训处理的平均匀调战略相比较。在消除超准性能计量器比高性能测试中展示了新方法的优势。我们用通用的机器人比对真实世界总体测试系统进行了验证。