We study the finite-horizon offline reinforcement learning (RL) problem. Since actions at any state can affect next-state distributions, the related distributional shift challenges can make this problem far more statistically complex than offline policy learning for a finite sequence of stochastic contextual bandit environments. We formalize this insight by showing that the statistical hardness of offline RL instances can be measured by estimating the size of actions' impact on next-state distributions. Furthermore, this estimated impact allows us to propagate just enough value function uncertainty from future steps to avoid model exploitation, enabling us to develop algorithms that improve upon traditional pessimistic approaches for offline RL on statistically simple instances. Our approach is supported by theory and simulations.
翻译:我们研究的离线强化学习(RL)问题。 由于任何州的行动都会影响下州分布, 相关的分布转移挑战可能会使这一问题在统计上比离线政策学习更复杂得多。 我们通过显示离线RL案例的统计难度可以通过估计行动规模对下州分布的影响来衡量。 此外,这种估计影响使我们能够从未来步骤中传播足够的价值函数不确定性,以避免模型利用,使我们能够制定算法,改进离线RL对统计简单实例的传统悲观方法。我们的计算方法得到了理论和模拟的支持。