Tree-form sequential decision making (TFSDM) extends classical one-shot decision making by modeling tree-form interactions between an agent and a potentially adversarial environment. It captures the online decision-making problems that each player faces in an extensive-form game, as well as Markov decision processes and partially-observable Markov decision processes where the agent conditions on observed history. Over the past decade, there has been considerable effort into designing online optimization methods for TFSDM. Virtually all of that work has been in the full-feedback setting, where the agent has access to counterfactuals, that is, information on what would have happened had the agent chosen a different action at any decision node. Little is known about the bandit setting, where that assumption is reversed (no counterfactual information is available), despite this latter setting being well understood for almost 20 years in one-shot decision making. In this paper, we give the first algorithm for the bandit linear optimization problem for TFSDM that offers both (i) linear-time iterations (in the size of the decision tree) and (ii) $O(\sqrt{T})$ cumulative regret in expectation compared to any fixed strategy, at all times $T$. This is made possible by new results that we derive, which may have independent uses as well: 1) geometry of the dilated entropy regularizer, 2) autocorrelation matrix of the natural sampling scheme for sequence-form strategies, 3) construction of an unbiased estimator for linear losses for sequence-form strategies, and 4) a refined regret analysis for mirror descent when using the dilated entropy regularizer.
翻译:树形序列决策( TFSDM ) ( TFSDM ) 包含典型的一发式决策, 通过模拟一个代理商和潜在对抗环境之间的树形互动模式, 扩展典型的一发式决策。 它记录了每个玩家在大形游戏中所面临的在线决策问题, 以及Markov 决策程序和部分可观测的Markov 决策程序, 其代理商在所观察到的历史中的条件。 在过去的十年中, 在设计 TFSDM 的在线优化方法方面做出了相当大的努力。 几乎所有这些工作都处于全弹后退场, 代理商都可以获得反事实的模拟互动, 也就是说, 如果代理商在任何决定节点选择了不同的动作, 则会发生什么情况。 很少有人知道这种假设被颠倒的( 没有反向信息 ) 以及部分可观测到的Markov 。 在本文中, TFSDMDM 的线状线性优化下降问题的第一个算法是 (i) (在决定树上) 的直径直线式排序, 3 和(ii) 正在变的序列, 在正常的 正在变的矩的 里程里程中, 当我们的 可能使用直径变的直径程 的 直压, 直压 的 将 将 的 的 的 直压的 直压 直压的, 。