We study social learning dynamics where the agents collectively follow a simple multi-armed bandit protocol. Agents arrive sequentially, choose arms and receive associated rewards. Each agent observes the full history (arms and rewards) of the previous agents, and there are no private signals. While collectively the agents face exploration-exploitation tradeoff, each agent acts myopically, without regards to exploration. Motivating scenarios concern reviews and ratings on online platforms. We allow a wide range of myopic behaviors that are consistent with (parameterized) confidence intervals, including the "unbiased" behavior as well as various behaviorial biases. While extreme versions of these behaviors correspond to well-known bandit algorithms, we prove that more moderate versions lead to stark exploration failures, and consequently to regret rates that are linear in the number of agents. We provide matching upper bounds on regret by analyzing "moderately optimistic" agents. As a special case of independent interest, we obtain a general result on failure of the greedy algorithm in multi-armed bandits. This is the first such result in the literature, to the best of our knowledge
翻译:我们研究社会学习动态,让代理人集体遵循简单的多武装土匪协议。 代理人按顺序到达, 选择武器并获得相关奖赏。 每个代理人观察前代理人的全部历史( 武器和奖赏), 没有私人信号 。 当代理人集体面临勘探- 开采交易时, 每种代理人的行为都是短视的, 不考虑勘探。 激励情景的情景涉及到在线平台的审查和评级。 我们允许一系列与( 参数化的) 信心间隔一致的短视行为, 包括“ 不受偏见的” 行为以及各种行为偏差。 虽然这些行为的极端版本符合众所周知的土匪算法, 但我们证明, 更温和的版本会导致探索失败, 从而导致代理人数量的直线率。 我们通过分析“ 温和乐观” 经纪人来提供相应的遗憾。 作为独立利益的特殊案例, 我们从多种武装匪徒的贪婪算法失败中获得了一般结果。 这是文献中的第一个结果, 最深了解我们的知识。