《架桥离线强化学习和模拟学习:悲观主义的故事》 (Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism)

Offline (or batch) reinforcement learning (RL) algorithms seek to learn an optimal policy from a fixed dataset without active data collection. Based on the composition of the offline dataset, two main categories of methods are used: imitation learning which is suitable for expert datasets and vanilla offline RL which often requires uniform coverage datasets. From a practical standpoint, datasets often deviate from these two extremes and the exact data composition is usually unknown a priori. To bridge this gap, we present a new offline RL framework that smoothly interpolates between the two extremes of data composition, hence unifying imitation learning and vanilla offline RL. The new framework is centered around a weak version of the concentrability coefficient that measures the deviation from the behavior policy to the expert policy alone. Under this new framework, we further investigate the question on algorithm design: can one develop an algorithm that achieves a minimax optimal rate and also adapts to unknown data composition? To address this question, we consider a lower confidence bound (LCB) algorithm developed based on pessimism in the face of uncertainty in offline RL. We study finite-sample properties of LCB as well as information-theoretic limits in multi-armed bandits, contextual bandits, and Markov decision processes (MDPs). Our analysis reveals surprising facts about optimality rates. In particular, in all three settings, LCB achieves a faster rate of $1/N$ for nearly-expert datasets compared to the usual rate of $1/\sqrt{N}$ in offline RL, where $N$ is the number of samples in the batch dataset. In the case of contextual bandits with at least two contexts, we prove that LCB is adaptively optimal for the entire data composition range, achieving a smooth transition from imitation learning to offline RL. We further show that LCB is almost adaptively optimal in MDPs.

翻译：离线( 或分批) 强化学习( RL) 算法试图从固定的数据集中学习最佳政策, 而没有积极收集数据。根据离线数据集的构成, 使用了两大类方法: 适合专家数据集的模仿学习和香草脱线 RL, 这往往需要统一的覆盖数据集。从实际的角度看, 数据集往往偏离这两个极端, 确切的数据构成通常是一个先验未知的。为了缩小这一差距, 我们提出了一个新的离线 RL 框架, 在数据构成的两个极端之间顺利互插, 从而统一最小的模仿学习和 Vanilla 离线 RL 。新的框架围绕一个弱化的调试调系数, 测量行为政策从行为政策向专家政策的偏差。在这个新框架下, 我们进一步调查算法设计的问题: 一个最优的算法, 达到最小型最佳的速率, 同时适应未知的数据构成。解决这个问题, 我们考虑一个更低的( LCB) 的算法根据平面的平流率, 几乎比重的 RL 的 RL 的 RL 比例数据, 分析中, 显示我们的里程的里程数据的里程的里程的数据的数据的的的的的直径解数据直径解数据数据数据显示的的的显示的的的直径直径解。