We consider stochastic bandit problems with $K$ arms, each associated with a bounded distribution supported on the range $[m,M]$. We do not assume that the range $[m,M]$ is known and show that there is a cost for learning this range. Indeed, a new trade-off between distribution-dependent and distribution-free regret bounds arises, which prevents from simultaneously achieving the typical $\ln T$ and $\sqrt{T}$ bounds. For instance, a $\sqrt{T}$}distribution-free regret bound may only be achieved if the distribution-dependent regret bounds are at least of order $\sqrt{T}$. We exhibit a strategy achieving the rates for regret indicated by the new trade-off.
翻译:我们考虑的是K$军火的盗匪问题,每个问题都与以[M]美元范围支持的捆绑分配有关。我们不认为知道[M]美元的范围,也不认为了解这个范围是有代价的。 事实上,在依赖分配和无分配的遗憾界限之间出现了新的权衡,这阻碍了同时实现典型的$(美元)和$(Sqrt{T})界限。例如,只有在依赖分配的遗憾界限至少为$(sqrt{T}美元的情况下,才能实现无分配的遗憾界限。 我们展示了一种战略,即实现新交易所显示的遗憾率。