In this work we consider the problem of regret minimization for logistic bandits. The main challenge of logistic bandits is reducing the dependence on a potentially large problem dependent constant $\kappa$ that can at worst scale exponentially with the norm of the unknown parameter $\theta_{\ast}$. Abeille et al. (2021) have applied self-concordance of the logistic function to remove this worst-case dependence providing regret guarantees like $O(d\log^2(\kappa)\sqrt{\dot\mu T}\log(|\mathcal{X}|))$ where $d$ is the dimensionality, $T$ is the time horizon, and $\dot\mu$ is the variance of the best-arm. This work improves upon this bound in the fixed arm setting by employing an experimental design procedure that achieves a minimax regret of $O(\sqrt{d \dot\mu T\log(|\mathcal{X}|)})$. Our regret bound in fact takes a tighter instance (i.e., gap) dependent regret bound for the first time in logistic bandits. We also propose a new warmup sampling algorithm that can dramatically reduce the lower order term in the regret in general and prove that it can replace the lower order term dependency on $\kappa$ to $\log^2(\kappa)$ for some instances. Finally, we discuss the impact of the bias of the MLE on the logistic bandit problem, providing an example where $d^2$ lower order regret (cf., it is $d$ for linear bandits) may not be improved as long as the MLE is used and how bias-corrected estimators may be used to make it closer to $d$.
翻译:在这项工作中,我们考虑尽量减少对后勤匪徒的遗憾问题。后勤匪徒的主要挑战是减少对潜在大问题的依赖性,其中美元依赖面值,美元为时平价,美元则按未知参数的规范,以最坏规模计算。Abeille等人(2021年)对后勤功能进行了自我协调,以消除这种最坏情况的依赖性,从而提供像O(d)log/2 (\ kapppa)\ sqrt\ dot\mu T ⁇ log ( ⁇ mathcal{X ⁇ )这样的遗憾保证性保证。在美元为面值,美元为时平价,美元为时平价值,美元为最差值。在固定手臂设置时,这项工作通过实验设计程序,实现美元(sqrt{d\d\d\ dod\ log)\ tlog ( ⁇ mathcal{X ⁇ ) 的微值遗憾性,我们对于美元更低的遗憾性实例是(i.e.xxxxxxxxxxxxxxx) 开始较低的货币,我们也可以在逻辑上较低的排序。