We consider risk-averse learning in repeated unknown games where the goal of the agents is to minimize their individual risk of incurring significantly high cost. Specifically, the agents use the conditional value at risk (CVaR) as a risk measure and rely on bandit feedback in the form of the cost values of the selected actions at every episode to estimate their CVaR values and update their actions. A major challenge in using bandit feedback to estimate CVaR is that the agents can only access their own cost values, which, however, depend on the actions of all agents. To address this challenge, we propose a new risk-averse learning algorithm with momentum that utilizes the full historical information on the cost values. We show that this algorithm achieves sub-linear regret and matches the best known algorithms in the literature. We provide numerical experiments for a Cournot game that show that our method outperforms existing methods.
翻译:我们考虑在反复的未知游戏中进行避免风险的学习,代理商的目标是最大限度地降低其个人承受巨大高昂成本的风险。具体地说,代理商将有条件的风险价值(CVaR)作为一种风险衡量尺度,并依靠每集选定行动的成本价值形式的土匪反馈来估计其CVaR值并更新其行动。使用土匪反馈来估计CVaR的一个主要挑战是,代理商只能获取他们自己的成本值,而成本值又取决于所有代理商的行动。为了应对这一挑战,我们提出了一种新的反风险学习算法,其动力是利用成本价值的全部历史信息。我们表明,这种算法实现了亚线性遗憾,并符合文献中已知的最佳算法。我们为Cournot游戏提供了数字实验,表明我们的方法优于现有方法。