Thompson sampling (TS) has attracted a lot of interest in the bandit area. It was introduced in the 1930s but has not been theoretically proven until recent years. All of its analysis in the combinatorial multi-armed bandit (CMAB) setting requires an exact oracle to provide optimal solutions with any input. However, such an oracle is usually not feasible since many combinatorial optimization problems are NP-hard and only approximation oracles are available. An example (Wang and Chen, 2018) has shown the failure of TS to learn with an approximation oracle. However, this oracle is uncommon and is designed only for a specific problem instance. It is still an open question whether the convergence analysis of TS can be extended beyond the exact oracle in CMAB. In this paper, we study this question under the greedy oracle, which is a common (approximation) oracle with theoretical guarantees to solve many (offline) combinatorial optimization problems. We provide a problem-dependent regret lower bound of order $\Omega(\log T/\Delta^2)$ to quantify the hardness of TS to solve CMAB problems with greedy oracle, where $T$ is the time horizon and $\Delta$ is some reward gap. We also provide an almost matching regret upper bound. These are the first theoretical results for TS to solve CMAB with a common approximation oracle and break the misconception that TS cannot work with approximation oracles.
翻译:Thompson 取样( TS) 吸引了对土匪地区的极大兴趣。 它在1930年代引入, 但直到近似节点才在理论上被证明。 它在组合式多武装土匪( CMAB) 设置中的所有分析都需要精确的神器才能提供任何输入的最佳解决方案。 但是, 这样的神器通常不可行, 因为许多组合式优化问题都是NP- 硬的, 只有近似或触角可以找到。 例如( Wang 和 Chen, 2018) 显示 TS 无法以近似或触角来学习。 然而, 这个神器并不常见, 并且只设计为特定的问题实例。 它的趋同分析TS( CAB) 的趋近似分析是否可以超越CMA 的精确度。 在本文中,我们在贪婪或触角下研究这个问题, 这是一个常见的( 同意) 或有理论保证解决许多( 离谱的) 组合式优化问题。 我们提供了一种基于问题的低级的遗憾, $( T/ Delta2), 来量化TS 的精确性( ) 和高端点( ) 也无法使 CMA) 共同解决 问题。