In reinforcement learning, off-policy actor-critic methods like DDPG and TD3 use deterministic policy gradients: the Q-function is learned from environment data, while the actor maximizes it via gradient ascent. We observe that in complex tasks such as dexterous manipulation and restricted locomotion with mobility constraints, the Q-function exhibits many local optima, making gradient ascent prone to getting stuck. To address this, we introduce SAVO, an actor architecture that (i) generates multiple action proposals and selects the one with the highest Q-value, and (ii) approximates the Q-function repeatedly by truncating poor local optima to guide gradient ascent more effectively. We evaluate tasks such as restricted locomotion, dexterous manipulation, and large discrete-action space recommender systems and show that our actor finds optimal actions more frequently and outperforms alternate actor architectures.
翻译:在强化学习中,诸如DDPG和TD3等离轨策略的演员-评论家方法采用确定性策略梯度:Q函数从环境数据中学习,而演员通过梯度上升最大化该函数。我们观察到,在灵巧操作和具有移动性约束的限制性运动等复杂任务中,Q函数呈现许多局部最优解,导致梯度上升容易陷入停滞。为解决此问题,我们提出了SAVO演员架构,该架构(i)生成多个动作提案并选择具有最高Q值的动作,(ii)通过截断不良局部最优解来反复逼近Q函数,从而更有效地引导梯度上升。我们在限制性运动、灵巧操作以及大型离散动作空间推荐系统等任务上进行了评估,结果表明我们的演员架构更频繁地找到最优动作,并优于其他演员架构。