Offline reinforcement learning learns an effective policy on offline datasets without online interaction, and it attracts persistent research attention due to its potential of practical application. However, extrapolation error generated by distribution shift will still lead to the overestimation for those actions that transit to out-of-distribution(OOD) states, which degrades the reliability and robustness of the offline policy. In this paper, we propose Contextual Conservative Q-Learning(C-CQL) to learn a robustly reliable policy through the contextual information captured via an inverse dynamics model. With the supervision of the inverse dynamics model, it tends to learn a policy that generates stable transition at perturbed states, for the fact that pertuebed states are a common kind of OOD states. In this manner, we enable the learnt policy more likely to generate transition that destines to the empirical next state distributions of the offline dataset, i.e., robustly reliable transition. Besides, we theoretically reveal that C-CQL is the generalization of the Conservative Q-Learning(CQL) and aggressive State Deviation Correction(SDC). Finally, experimental results demonstrate the proposed C-CQL achieves the state-of-the-art performance in most environments of offline Mujoco suite and a noisy Mujoco setting.
翻译:离线强化学习在不在线互动的情况下学习关于离线数据集的有效政策,并因其实际应用潜力而吸引持续的研究关注。然而,分配转移产生的外推错误仍将导致高估那些转至离线(OOD)状态的行动,从而降低离线政策的可靠性和稳健性。在本文中,我们提议通过通过反向动态模型获取的背景信息来学习强有力的可靠政策。在反向动态模型的监督下,它倾向于学习一种在偏向状态产生稳定过渡的政策,因为处于边缘状态的国家是OOOD状态的常见类型。通过这种方式,我们使得所学的政策更有可能产生向离线数据集(即,稳健可靠的过渡)下一个经验性状态分布的转变。此外,我们理论上表明C-CQL(C-QIL)是保守性Q-LINTED(C QQL) 的概括化, 以及侵略性国的实验性-C-C-C-C-C-C-C-C-C-L-SAL-S-SAL-SAL-SAL-SAL-SAL-SAL-ATINSL-SAL-SAL-SAL-SAL-SAL-SAL-SAL-ATINS-ATINS-AD-AD-SL-SL-AD-ATINSL-S-S-S-S-S-N-N-S-S-S-S-S-S-AD-L-S-S-S-N-N-MATINSL-S-S-L-L-L-I-L-L-L-L-L-L-S-I-N-MATINSTITITINS-ATITITINS-S-S-N-N-TINS-N-N-N-N-N-AD-N-N-N-N-S-I-S-N-N-N-MATIal-MAD-N-S-S-S-S-N-N-N-N-AD-AD-N-AD-S-S-S-S-S-S-AD-N-N-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA