Offline reinforcement learning (RL) enables effective learning from previously collected data without exploration, which shows great promise in real-world applications when exploration is expensive or even infeasible. The discount factor, $\gamma$, plays a vital role in improving online RL sample efficiency and estimation accuracy, but the role of the discount factor in offline RL is not well explored. This paper examines two distinct effects of $\gamma$ in offline RL with theoretical analysis, namely the regularization effect and the pessimism effect. On the one hand, $\gamma$ is a regulator to trade-off optimality with sample efficiency upon existing offline techniques. On the other hand, lower guidance $\gamma$ can also be seen as a way of pessimism where we optimize the policy's performance in the worst possible models. We empirically verify the above theoretical observation with tabular MDPs and standard D4RL tasks. The results show that the discount factor plays an essential role in the performance of offline RL algorithms, both under small data regimes upon existing offline methods and in large data regimes without other conservatisms.
翻译:离线强化学习(RL)能够从先前收集的未经勘探的数据中有效学习,这些数据在勘探费用昂贵甚至不可行的情况下,在现实世界应用中显示出巨大的希望。折扣系数($\gamma$)在提高在线RL抽样效率和估计准确性方面发挥着至关重要的作用,但离线强化学习(RL)的折扣系数的作用没有很好地探讨。本文通过理论分析,研究了离线RL中$\gamma$的两种不同影响,即正规化效应和悲观效应。一方面,$\gamma$是利用现有离线技术样本效率进行最佳交易的调节者。另一方面,较低的指导值($\gamma$)也可以被视为一种悲观主义的方式,我们在最坏的模型中优化了政策绩效。我们用表格MDPs和标准D4RL任务对上述理论观察进行了经验性核实。结果显示,折扣系数在离线RL算法的绩效方面起着关键作用,在现有的离线数据系统下,在没有其他保守主义的大型数据系统中。