Offline reinforcement learning (RL) can be used to improve future performance by leveraging historical data. There exist many different algorithms for offline RL, and it is well recognized that these algorithms, and their hyperparameter settings, can lead to decision policies with substantially differing performance. This prompts the need for pipelines that allow practitioners to systematically perform algorithm-hyperparameter selection for their setting. Critically, in most real-world settings, this pipeline must only involve the use of historical data. Inspired by statistical model selection methods for supervised learning, we introduce a task- and method-agnostic pipeline for automatically training, comparing, selecting, and deploying the best policy when the provided dataset is limited in size. In particular, our work highlights the importance of performing multiple data splits to produce more reliable algorithm-hyperparameter selection. While this is a common approach in supervised learning, to our knowledge, this has not been discussed in detail in the offline RL setting. We show it can have substantial impacts when the dataset is small. Compared to alternate approaches, our proposed pipeline outputs higher-performing deployed policies from a broad range of offline policy learning algorithms and across various simulation domains in healthcare, education, and robotics. This work contributes toward the development of a general-purpose meta-algorithm for automatic algorithm-hyperparameter selection for offline RL.
翻译:离线强化学习( RL) 可用于利用历史数据改善未来绩效。 离线RL 存在许多不同的算法, 并且人们公认, 这些算法及其超参数设置可以导致决定政策, 其性能差异很大。 这促使需要管道, 使从业者能够系统进行算法- 超光谱选择, 以为自己设置。 关键地说, 在大多数现实世界环境中, 这一管道只能使用历史数据。 受统计模型选择方法的启发, 监督学习, 我们引入了一种任务和方法- 方法- 保密管道, 用于自动培训、 比较、 选择和部署最佳政策, 当所提供的数据集规模有限时。 特别是, 我们的工作凸显了进行多种数据分割的重要性, 以产生更可靠的算法- 度选择。 这在监督学习中是常见的做法, 但对于我们的知识来说, 在离线 RL 设置时, 我们显示它可能会在数据设置小的时候产生巨大的影响。 与替代方法相比, 我们提出的管道产出更高表现的管道, 是从广泛的离线政策系统化的模型分析, 学习一般的机器人分析, 和跨轨道的模型选择。