Adversarial Imitation Learning (AIL) is a class of popular state-of-the-art Imitation Learning algorithms where an artificial adversary's misclassification is used as a reward signal and is optimized by any standard Reinforcement Learning (RL) algorithm. Unlike most RL settings, the reward in AIL is differentiable but model-free RL algorithms do not make use of this property to train a policy. In contrast, we leverage the differentiability property of the AIL reward function and formulate a class of Actor Residual Critic (ARC) RL algorithms that draw a parallel to the standard Actor-Critic (AC) algorithms in RL literature and uses a residual critic, C function (instead of the standard Q function) to approximate only the discounted future return (excluding the immediate reward). ARC algorithms have similar convergence properties as the standard AC algorithms with the additional advantage that the gradient through the immediate reward is exact. For the discrete (tabular) case with finite states, actions, and known dynamics, we prove that policy iteration with $C$ function converges to an optimal policy. In the continuous case with function approximation and unknown dynamics, we experimentally show that ARC aided AIL outperforms standard AIL in simulated continuous-control and real robotic manipulation tasks. ARC algorithms are simple to implement and can be incorporated into any existing AIL implementation with an AC algorithm.
翻译:AIL 是一种受欢迎的最先进的模拟学习算法,其中人为对手错误分类用作奖赏信号,并被任何标准的加强学习算法优化。与大多数 RL 设置不同, AIL 的奖赏是不同的,但没有模型的RL 算法没有利用该属性来培训政策。相比之下,我们利用AIL 奖赏功能的可变性,并制定了一种ARC 算法,与RL 文献中标准的Actor-Critical算法平行,并使用残余评论家C 函数(而不是标准Q函数)来优化。 ARC 算法具有类似于标准 AC 算法的趋同性,其额外优势是,通过直接奖赏的梯度可以准确。对于与固定状态、动作和已知动态的ARC 分解(ARC) 和已知的A 运算算算法,我们证明政策与A AL AS 最不相适应的A 和ARC 的A AS 校正 校程功能, 与A AS AS 校准的A 校正 校正 校正 校正 校正 校正 校正 校正 校正 显示 校正 校正 校正 校正 校正 校正 校正 校正 校正 校正 校正 校正 校正