While there has been substantial success in applying actor-critic methods to continuous control, simpler critic-only methods such as Q-learning often remain intractable in the associated high-dimensional action spaces. However, most actor-critic methods come at the cost of added complexity: heuristics for stabilization, compute requirements as well as wider hyperparameter search spaces. We show that these issues can be largely alleviated via Q-learning by combining action discretization with value decomposition, framing single-agent control as cooperative multi-agent reinforcement learning (MARL). With bang-bang actions, performance of this critic-only approach matches state-of-the-art continuous actor-critic methods when learning from features or pixels. We extend classical bandit examples from cooperative MARL to provide intuition for how decoupled critics leverage state information to coordinate joint optimization, and demonstrate surprisingly strong performance across a wide variety of continuous control tasks.
翻译:虽然在将行为者-批评方法应用到持续控制方面取得了很大成功,但更简单的批评方法,如Q-学习等,在相关的高维行动空间中仍然难以使用。然而,大多数行为者-批评方法都是以增加复杂性为代价的:稳定理论、计算要求以及更广泛的超光度搜索空间。我们表明,这些问题可以通过Q-学习通过将行动分解与价值分解相结合、将单一试剂控制作为合作性多剂强化学习(MARL)来在很大程度上缓解。随着砰冲行动,这种只使用批评者的方法在学习特征或像素时,其表现与最先进的连续的行为者-批评方法相匹配。我们扩展了MARL合作的经典模范例子,为分解的批评者如何利用国家信息协调联合优化提供直觉,并展示出在广泛的连续控制任务中令人惊讶的出色表现。