The multi-armed bandit (MAB) problem is a ubiquitous decision-making problem that exemplifies exploration-exploitation tradeoff. Standard formulations exclude risk in decision making. Risknotably complicates the basic reward-maximising objectives, in part because there is no universally agreed definition of it. In this paper, we consider an entropic risk (ER) measure and explore the performance of a Thompson sampling-based algorithm ERTS under this risk measure by providing regret bounds for ERTS and corresponding instance dependent lower bounds.
翻译:多武装土匪(MAB)问题是一个无处不在的决策问题,它体现了勘探和开发的权衡。标准配方排除了决策中的风险。最有可能使基本奖励最大化目标复杂化,部分原因是没有普遍同意的定义。在本文中,我们考虑一种热带风险(ER)措施,并探索在这一风险措施下采用汤普森抽样算法ERTS的性能,为ERTS和相应案例依附的下限提供遗憾界限。