In many practical applications of RL, it is expensive to observe state transitions from the environment. For example, in the problem of plasma control for nuclear fusion, computing the next state for a given state-action pair requires querying an expensive transition function which can lead to many hours of computer simulation or dollars of scientific research. Such expensive data collection prohibits application of standard RL algorithms which usually require a large number of observations to learn. In this work, we address the problem of efficiently learning a policy while making a minimal number of state-action queries to the transition function. In particular, we leverage ideas from Bayesian optimal experimental design to guide the selection of state-action queries for efficient learning. We propose an acquisition function that quantifies how much information a state-action pair would provide about the optimal solution to a Markov decision process. At each iteration, our algorithm maximizes this acquisition function, to choose the most informative state-action pair to be queried, thus yielding a data-efficient RL approach. We experiment with a variety of simulated continuous control problems and show that our approach learns an optimal policy with up to $5$ -- $1,000\times$ less data than model-based RL baselines and $10^3$ -- $10^5\times$ less data than model-free RL baselines. We also provide several ablated comparisons which point to substantial improvements arising from the principled method of obtaining data.
翻译:例如,在核聚变的等离子体控制问题上,为某个州对子计算下一个状态需要查询一个昂贵的过渡功能,这可能导致计算机模拟或科学研究的美元等许多小时。这种昂贵的数据收集禁止应用通常需要大量观察才能学习的标准RL算法。在这项工作中,我们处理高效率地学习一项政策,同时对过渡功能进行最低限度的国家行动查询的问题。我们特别利用巴伊西亚最佳实验设计中的想法来指导选择州行动查询的有效学习。我们提议了一个获取功能,以量化州行动对子能提供多少信息来最佳解决Markov决定程序的最佳解决方案。在每一次循环中,我们的算法会最大限度地发挥这种获取功能,以便选择最丰富的州行动对子,从而产生一种数据高效的RL方法。我们实验了各种模拟的持续控制问题,并表明我们的方法学习了最高达5美元的最佳政策 -- 1 000美元,比10美元基准比10美元低一个数据基准。我们提供比10美元标准要低的模型。