This paper unifies the design and the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem for a class of risk functionals $\rho$ that are continuous and dominant. We prove generalised concentration bounds for these continuous and dominant risk functionals and show that a wide class of popular risk functionals belong to this class. Using our newly developed analytical toolkits, we analyse the algorithm $\rho$-MTS (for multinomial distributions) and prove that they admit asymptotically optimal regret bounds of risk-averse algorithms under CVaR, proportional hazard, and other ubiquitous risk measures. More generally, we prove the asymptotic optimality of $\rho$-MTS for Bernoulli distributions for a class of risk measures known as empirical distribution performance measures (EDPMs); this includes the well-known mean-variance. Numerical simulations show that the regret bounds incurred by our algorithms are reasonably tight vis-\`a-vis algorithm-independent lower bounds.
翻译:本文统一了多臂强盗问题多臂强盗风险反汤普森抽样算法的设计和分析。 我们证明这些连续和主要风险功能具有普遍集中的界限,并表明一大批受欢迎的风险功能属于这一类别。 我们利用我们新开发的分析工具包,分析(用于多名分布的)美元/美元-MTS算法,并证明他们承认CVaR下风险反动算法、比例风险和其他常见风险措施的微弱遗憾界限。 更一般地说,我们证明Bernoulli分配的美元-MTS在被称为实证分配性绩效衡量(EDPMS)的一类风险计量中具有无症状的最佳性;这包括众所周知的中值差。 数字模拟表明,我们算法产生的遗憾界限相对于依赖算法的较低界限相当紧凑。