Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. Although interest in this problem has grown significantly in recent years, its theoretical foundations remain under-developed. To advance the understanding of this problem, we provide three results that characterize the limits and possibilities of batch policy optimization in the finite-armed stochastic bandit setting. First, we introduce a class of confidence-adjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis. For this family, we show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral. Our analysis reveals that instance-dependent optimality, commonly used to establish optimality of on-line stochastic bandit algorithms, cannot be achieved by any algorithm in the batch setting. In particular, for any algorithm that performs optimally in some environment, there exists another environment where the same algorithm suffers arbitrarily larger regret. Therefore, to establish a framework for distinguishing algorithms, we introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction. We demonstrate how this criterion can be used to justify commonly used pessimistic principles for batch policy optimization.
翻译:批量政策优化考虑在与环境互动之前将现有数据用于政策构建。 尽管近年来对这一问题的兴趣有了显著增长, 但其理论基础仍然发展不足。 为促进对这一问题的理解, 我们提供了三种结果, 说明在有限武装的随机强盗环境中批量政策优化的局限性和可能性。 首先, 我们引入了一套信任调整指数算法, 将乐观和悲观原则统一在一个共同框架中, 从而能够进行总体分析。 对于这个家庭, 我们显示任何信心调整指数算法都是最优化的, 不管是乐观的、 悲观的还是中性的。 我们的分析显示, 以实例为基础的最佳性最优性, 通常用来在批量环境下建立在线随机强的轮廓算法的最佳性, 无法通过任何算法实现。 特别是, 对于在某种环境中最优化地运行的任何算法, 都存在另一种环境, 同一算法会遭到任意更大的遗憾。 因此, 为了建立一个区分算法的框架, 我们引入一种新的加权- 微量法标准, 来考虑最佳价值预测的内在困难。 我们演示如何使用这一标准来证明通常使用比例政策优化原则。