The principle of optimism in the face of uncertainty is one of the most widely used and successful ideas in multi-armed bandits and reinforcement learning. However, existing optimistic algorithms (primarily UCB and its variants) are often unable to deal with large context spaces. Essentially all existing well performing algorithms for general contextual bandit problems rely on weighted action allocation schemes; and theoretical guarantees for optimism-based algorithms are only known for restricted formulations. In this paper we study general contextual bandits under the realizability condition, and propose a simple generic principle to design optimistic algorithms, dubbed "Upper Counterfactual Confidence Bounds" (UCCB). We show that these algorithms are provably optimal and efficient in the presence of large context spaces. Key components of UCCB include: 1) a systematic analysis of confidence bounds in policy space rather than in action space; and 2) the potential function perspective that is used to express the power of optimism in the contextual setting. We further show how the UCCB principle can be extended to infinite action spaces, by constructing confidence bounds via the newly introduced notion of "counterfactual action divergence."
翻译:面对不确定性的乐观原则是多武装强盗和强化学习中最广泛使用和最成功的想法之一,然而,现有的乐观算法(主要是UCB及其变异物)往往无法处理大的背景空间。一般背景强盗问题的所有现有运作良好的算法基本上都依赖于加权行动分配办法;乐观算法的理论保障只为有限的配方所知道。在本文中,我们根据真实性条件研究一般背景强盗,并提出一个简单的通用原则来设计乐观的算法,称为“顶级反事实信任犬”(UCCB)。我们表明,这些算法在大的背景空间存在时,是可实现最佳和高效的。UCB的关键组成部分包括:1)系统分析政策空间而不是行动空间的信任界限;2)用于表达背景环境中乐观力的潜在功能观点。我们进一步表明,如何通过新引入的“相对行动差异”概念,将信任界限扩大到无限的行动空间。