The goal of reinforcement learning (RL) is to maximize the expected cumulative return. It has been shown that this objective can be represented by an optimization problem of the state-action visitation distribution under linear constraints. The dual problem of this formulation, which we refer to as dual RL, is unconstrained and easier to optimize. We show that several state-of-the-art off-policy deep reinforcement learning (RL) algorithms, under both online and offline, RL and imitation learning (IL) settings, can be viewed as dual RL approaches in a unified framework. This unification provides a common ground to study and identify the components that contribute to the success of these methods and also reveals the common shortcomings across methods with new insights for improvement. Our analysis shows that prior off-policy imitation learning methods are based on an unrealistic coverage assumption and are minimizing a particular f-divergence between the visitation distributions of the learned policy and the expert policy. We propose a new method using a simple modification to the dual RL framework that allows for performant imitation learning with arbitrary off-policy data to obtain near-expert performance, without learning a discriminator. Further, by framing a recent SOTA offline RL method XQL in the dual RL framework, we propose alternative choices to replace the Gumbel regression loss, which achieve improved performance and resolve the training instability issue of XQL. Project code and details can be found at this https://hari-sikchi.github.io/dual-rl.
翻译:暂无翻译