离线 RL 无 OOD 行为：通过隐式价值正则化的样本内学习 (Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization)

Most offline reinforcement learning (RL) methods suffer from the trade-off between improving the policy to surpass the behavior policy and constraining the policy to limit the deviation from the behavior policy as computing $Q$-values using out-of-distribution (OOD) actions will suffer from errors due to distributional shift. The recently proposed \textit{In-sample Learning} paradigm (i.e., IQL), which improves the policy by quantile regression using only data samples, shows great promise because it learns an optimal policy without querying the value function of any unseen actions. However, it remains unclear how this type of method handles the distributional shift in learning the value function. In this work, we make a key finding that the in-sample learning paradigm arises under the \textit{Implicit Value Regularization} (IVR) framework. This gives a deeper understanding of why the in-sample learning paradigm works, i.e., it applies implicit value regularization to the policy. Based on the IVR framework, we further propose two practical algorithms, Sparse $Q$-learning (SQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works, but in a complete in-sample manner. Compared with IQL, we find that our algorithms introduce sparsity in learning the value function, making them more robust in noisy data regimes. We also verify the effectiveness of SQL and EQL on D4RL benchmark datasets and show the benefits of in-sample learning by comparing them with CQL in small data regimes.

翻译：大多数离线强化学习 (RL) 方法都面临着改进策略超越行为策略和约束策略以限制与行为策略的偏离之间的权衡。这是因为使用 OOD (out-of-distribution) 行为计算 $Q$ 值将由于分布偏移而产生错误。最近提出的“样本内学习”范例（即 IQL），通过仅使用数据样本进行分位回归，展现了极大的潜力，因为它在不查询任何未见行为的值函数的情况下学习了最优策略。然而，这种方法如何处理学习价值函数中的分布偏移仍然不清楚。在这项工作中，我们做出了一个重要发现，即样本内学习范例在“隐式价值正则化”（IVR）框架下出现。这给了我们更深入的理解为什么样本内学习范例有效，即它将隐式价值正则化应用于策略中。基于 IVR 框架，我们进一步提出了两个实用算法，即稀疏 $Q$-learning（SQL）和指数 $Q$-learning（EQL），这些算法采用现有工作中使用的相同价值正则化，但以完全的样本内方式进行。与 IQL 相比，我们发现我们的算法在学习价值函数时引入了稀疏性，使它们在噪声数据环境中更加稳健。我们还验证了 SQL 和 EQL 在 D4RL 基准数据集上的有效性，并通过在小数据范围内与 CQL 进行比较，展示了样本内学习的好处。