The remarkable success of reinforcement learning (RL) heavily relies on observing the reward of every visited state-action pair. In many real world applications, however, an agent can observe only a score that represents the quality of the whole trajectory, which is referred to as the {\em trajectory-wise reward}. In such a situation, it is difficult for standard RL methods to well utilize trajectory-wise reward, and large bias and variance errors can be incurred in policy evaluation. In this work, we propose a novel offline RL algorithm, called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED), which decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value iteration based on the learned proxy reward. To ensure the value functions constructed by PARTED are always pessimistic with respect to the optimal ones, we design a new penalty term to offset the uncertainty of the proxy reward. For general episodic MDPs with large state space, we show that PARTED with overparameterized neural network function approximation achieves an $\tilde{\mathcal{O}}(D_{\text{eff}}H^2/\sqrt{N})$ suboptimality, where $H$ is the length of episode, $N$ is the total number of samples, and $D_{\text{eff}}$ is the effective dimension of the neural tangent kernel matrix. To further illustrate the result, we show that PARTED achieves an $\tilde{\mathcal{O}}(dH^3/\sqrt{N})$ suboptimality with linear MDPs, where $d$ is the feature dimension, which matches with that with neural network function approximation, when $D_{\text{eff}}=dH$. To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
翻译:具有轨迹-wise 奖励的可证明有效的离线强化学习
翻译后的摘要:
针对实际应用中只能观察到代表整个轨迹质量的得分(称为“轨迹-wise 奖励”)的问题,本文提出了一种离线强化学习算法 PARTED。PARTED 通过基于最小二乘的奖励重新分配方法将轨迹回报分解为逐步代理奖励,然后根据学习到的代理奖励进行悲观值迭代。为了确保 PARTED 构建的价值函数始终对最优价值函数悲观,本文设计了一种新的惩罚项来抵消代理奖励的不确定性。对于具有大状态空间的一般历程 MDP,本文证明了具有超参数化神经网络函数逼近的 PARTED 可以实现 $\tilde{\mathcal{O}}(D_{\text{eff}}H^2/\sqrt{N})$ 的次优性,其中 $H$ 是轨迹长度,$N$ 是样本总数,$D_{\text{eff}}$ 是神经切向核矩阵的有效维度。此外,本文还说明了 PARTED 在具有线性 MDP 时可以实现 $\tilde{\mathcal{O}}(dH^3/\sqrt{N})$ 的次优性,其中 $d$ 是特征维度,当 $D_{\text{eff}}=dH$ 时,与神经网络函数逼近的结果相匹配。据我们所知,PARTED 是第一个在具有轨迹-wise 奖励的一般 MDP 中可以被证明有效的离线 RL 算法。