We consider a stochastic multi-armed bandit setting where feedback is limited by a (possibly time-dependent) budget, and reward must be actively inquired for it to be observed. Previous works on this setting assumed a strict feedback budget and focused on not violating this constraint while providing problem-independent regret guarantees. In this work, we provide problem-dependent guarantees on both the regret and the asked feedback. In particular, we derive problem-dependent lower bounds on the required feedback and show that there is a fundamental difference between problems with a unique and multiple optimal arms. Furthermore, we present a new algorithm called BuFALU for which we derive problem-dependent regret and cumulative feedback bounds. Notably, we show that BuFALU naturally adapts to the number of optimal arms.
翻译:我们考虑的是一个复杂多武装的匪徒环境,其反馈受到(可能取决于时间)预算的限制,必须积极征求对反馈的注意。以前关于这一环境的工作假定了严格的反馈预算,侧重于不违反这一限制,同时提供问题独立的遗憾保证。在这项工作中,我们对遗憾和被询问的反馈都提供基于问题的保证。特别是,我们从所需的反馈中得出基于问题的较低界限,并表明独特和多种最佳武器的问题之间有着根本的区别。此外,我们提出了一种叫做BuFALU的新算法,我们从中得出了基于问题的遗憾和累积反馈。值得注意的是,我们表明,BuFALU自然会适应最佳武器的数量。