In this paper, we consider stochastic multi-armed bandits (MABs) with heavy-tailed rewards, whose $p$-th moment is bounded by a constant $\nu_{p}$ for $1<p\leq2$. First, we propose a novel robust estimator which does not require $\nu_{p}$ as prior information, while other existing robust estimators demand prior knowledge about $\nu_{p}$. We show that an error probability of the proposed estimator decays exponentially fast. Using this estimator, we propose a perturbation-based exploration strategy and develop a generalized regret analysis scheme that provides upper and lower regret bounds by revealing the relationship between the regret and the cumulative density function of the perturbation. From the proposed analysis scheme, we obtain gap-dependent and gap-independent upper and lower regret bounds of various perturbations. We also find the optimal hyperparameters for each perturbation, which can achieve the minimax optimal regret bound with respect to total rounds. In simulation, the proposed estimator shows favorable performance compared to existing robust estimators for various $p$ values and, for MAB problems, the proposed perturbation strategy outperforms existing exploration methods.
翻译:在本文中,我们考虑的是具有高度尾量奖励的随机多武装强盗(MABs),其第一刻的美元被1美元<p>=leq2美元的恒定美元所约束。首先,我们提出一个新的强健的估测器,它不需要美元作为先前的信息,而其他现有的强健的估测器则要求事先了解$\nu ⁇ p}美元。我们发现,提议的估测器的误差概率指数迅速下降。我们利用这个估测器,提出一个以扰动为基础的勘探战略,并制定一个普遍的遗憾分析计划,通过揭示遗憾与扰动累积密度函数之间的关系,提供上下级的遗憾界限。首先,我们提议了一个新的强健的估测器,根据拟议的分析计划,我们获得了各种扰动的偏差和偏差的上下边缘。我们还发现,每次扰动时的最佳超标度计,能够达到与整轮相比的微轴峰最佳误差。在模拟中,拟议的估测算器显示,现有探测仪的优性战略比现有强度方法。