In decision-making problems such as the multi-armed bandit, an agent learns sequentially by optimizing a certain feedback. While the mean reward criterion has been extensively studied, other measures that reflect an aversion to adverse outcomes, such as mean-variance or conditional value-at-risk (CVaR), can be of interest for critical applications (healthcare, agriculture). Algorithms have been proposed for such risk-aware measures under bandit feedback without contextual information. In this work, we study contextual bandits where such risk measures can be elicited as linear functions of the contexts through the minimization of a convex loss. A typical example that fits within this framework is the expectile measure, which is obtained as the solution of an asymmetric least-square problem. Using the method of mixtures for supermartingales, we derive confidence sequences for the estimation of such risk measures. We then propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits. This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent, at the cost of slightly higher regret. We conclude by evaluating the resulting algorithms on numerical experiments.
翻译:在决策问题上,如多武装土匪,一个代理人通过优化某种反馈,按顺序学习。虽然对中值奖赏标准进行了广泛研究,但反映厌恶不利结果的其他措施,如中差或有条件风险价值(CVaR),对关键应用(保健、农业)可能很有意义。根据土匪反馈,在无背景信息的情况下,为此类风险意识措施提出了算法建议。在这项工作中,我们研究背景土匪,通过尽量减少convex损失,这种风险措施可以作为环境的线性功能被引出。这个框架内的一个典型例子是预期性措施,这是非对最小方问题的解决办法。我们利用混合混合方法,为估计此类风险措施建立信心序列。我们然后提出一个乐观的 UCB算法,以学习最佳风险意识行动,并有类似于一般线性土匪的遗憾保证。这一方法要求解决每轮算法的共线性功能问题,我们可以通过只允许通过微小程度地评估通过在线梯系获得的数值分析而获得的数值的大致解决办法来放松。