In decision-making problems such as the multi-armed bandit, an agent learns sequentially by optimizing a certain feedback. While the mean reward criterion has been extensively studied, other measures that reflect an aversion to adverse outcomes, such as mean-variance or conditional value-at-risk (CVaR), can be of interest for critical applications (healthcare, agriculture). Algorithms have been proposed for such risk-aware measures under bandit feedback without contextual information. In this work, we study contextual bandits where such risk measures can be elicited as linear functions of the contexts through the minimization of a convex loss. A typical example that fits within this framework is the expectile measure, which is obtained as the solution of an asymmetric least-square problem. Using the method of mixtures for supermartingales, we derive confidence sequences for the estimation of such risk measures. We then propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits. This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent, at the cost of slightly higher regret. We conclude by evaluating the resulting algorithms on numerical experiments.
翻译:在诸如多臂赌博机等决策问题中,代理通过优化某种反馈来进行顺序学习。虽然平均奖励标准得到了广泛研究,但其他反映对不良结果的厌恶的度量,例如平均-方差或条件风险价值(CVaR),对于关键应用(如医疗保健、农业)可以是有益的。已提出算法,用于在没有背景信息的情况下进行此类风险感知度量的赌博反馈。在这项工作中,我们研究了上下文赌徒,其中此类风险度量可以通过减小凸损失函数的上下文的线性函数来引出。一个典型的例子是通过解决一个不对称最小二乘问题获得的expectile度量。使用超过鞍点的混合方法,我们为估计此类风险测量提供置信序列。然后,我们提出了一个乐观的UCB算法,用于学习最佳的风险感知行动,类似于广义线性赌徒的遗憾保证。该方法要求在算法的每一轮中解决一个凸问题,我们可以通过仅允许通过在线梯度下降获得的近似解来放宽这个问题,代价是略高的遗憾。最后,我们通过数值实验评估了结果算法的性能。