Consider a bandit learning environment. We demonstrate that popular learning algorithms such as Upper Confidence Band (UCB) and $\varepsilon$-Greedy exhibit risk aversion: when presented with two arms of the same expectation, but different variance, the algorithms tend to not choose the riskier, i.e. higher variance, arm. We prove that $\varepsilon$-Greedy chooses the risky arm with probability tending to $0$ when faced with a deterministic and a Rademacher-distributed arm. We show experimentally that UCB also shows risk-averse behavior, and that risk aversion is present persistently in early rounds of learning even if the riskier arm has a slightly higher expectation. We calibrate our model to a recommendation system and show that algorithmic risk aversion can decrease consumer surplus and increase homogeneity. We discuss several extensions to other bandit algorithms, reinforcement learning, and investigate the impacts of algorithmic risk aversion for decision theory.
翻译:考虑一个土匪学习环境。 我们证明高信任乐队(UB) 和 $\varepsilon$-Greedy 等受欢迎的学习算法表现出风险厌恶: 当提出两个相同期望但差异不同的双臂时, 算法往往不会选择风险更大者, 即更高的差异, 手臂。 我们证明$\varepsilon$- Greedy 选择了危险的手臂, 当面临确定论和Rademacher 分布的手臂时, 其概率可能为0美元。 我们实验性地显示 UCB 也表现出风险反感行为, 早期学习也持续存在风险厌恶。 我们调整了我们的模型, 并表明算法风险转换可以减少消费者的盈余, 增加同质性。 我们讨论其他土匪算法的几种扩展, 强化学习, 并调查算法风险转换决定理论的影响 。