In several applications of the stochastic multi-armed bandit problem, the traditional objective of maximizing the expected total reward can be inappropriate. In this paper, motivated by certain operational concerns in online platforms, we consider a new objective in the classical setup. Given $K$ arms, instead of maximizing the expected total reward from $T$ pulls (the traditional "sum" objective), we consider the vector of total rewards earned from each of the $K$ arms at the end of $T$ pulls and aim to maximize the expected highest total reward across arms (the "max" objective). For this objective, we show that any policy must incur an instance-dependent asymptotic regret of $\Omega(\log T)$ (with a higher instance-dependent constant compared to the traditional objective) and a worst-case regret of $\Omega(K^{1/3}T^{2/3})$. We then design an adaptive explore-then-commit policy featuring exploration based on appropriately tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and achieves these bounds (up to logarithmic factors). We then generalize our algorithmic insights to the problem of maximizing the expected value of the average total reward of the top $m$ arms with the highest total rewards. Our numerical experiments demonstrate the efficacy of our policies compared to several natural alternatives in practical parameter regimes. We discuss applications of these new objectives to the problem of grooming an adequate supply of value-providing market participants (workers/sellers/service providers) in online platforms.
翻译:在多武装盗匪问题的若干应用中,最大限度地增加预期总报酬的传统目标可能是不恰当的。 在本文中,基于在线平台的某些业务关切,我们考虑经典设置中的新目标。考虑到K$武器,而不是尽量扩大预期从$T拉(传统“总”目标)获得的总报酬(传统“总”目标),我们考虑在$T$结束时从每件武器获得的总报酬的矢量,目的是最大限度地实现预期的最高总报酬(“最大”目标 ) 。为此,我们表明,任何政策都必须产生以美元(glog T)为基准的、以美元为基准的、以美元为基准的、以美元为基准的、以美元为基准的、以美元为基准的、以美元为基准的、以美元为基准的、以美元为基准的、以美元为基准的、以美元为基准的、以美元为基准的、以美元为基准的、以美元为基准的货币价值为基准的货币价值。我们随后设计了一个适应性探索政策,根据对平均奖赏和适适止标准进行探索,以适应于标准,从而将问题与问题推算算,并实现我们当前总体的货币价值的货币价值的货币价值的货币价值的估算。</s>