When humans collaborate with each other, they often make decisions by observing others and considering the consequences that their actions may have on the entire team, instead of greedily doing what is best for just themselves. We would like our AI agents to effectively collaborate in a similar way by capturing a model of their partners. In this work, we propose and analyze a decentralized Multi-Armed Bandit (MAB) problem with coupled rewards as an abstraction of more general multi-agent collaboration. We demonstrate that na\"ive extensions of single-agent optimal MAB algorithms fail when applied for decentralized bandit teams. Instead, we propose a Partner-Aware strategy for joint sequential decision-making that extends the well-known single-agent Upper Confidence Bound algorithm. We analytically show that our proposed strategy achieves logarithmic regret, and provide extensive experiments involving human-AI and human-robot collaboration to validate our theoretical findings. Our results show that the proposed partner-aware strategy outperforms other known methods, and our human subject studies suggest humans prefer to collaborate with AI agents implementing our partner-aware strategy.
翻译:当人类彼此合作时,他们往往通过观察他人来作出决定,并思考其行动对整个团队可能带来的后果,而不是贪婪地为自身着想。我们希望我们的AI代理机构通过捕捉其伙伴的模型来以类似的方式有效地合作。在这项工作中,我们提议和分析一个分散的多武装盗匪(MAB)问题,同时将奖励作为更一般的多剂合作的抽象概念。我们证明,在应用分散的土匪团队时,单剂最佳MAB算法的反向扩展失败。相反,我们提议了一个伙伴-软件战略,用于联合的顺序决策,以扩展众所周知的单一代理人的高度信任包件算法。我们分析性地表明,我们提出的战略实现了对论的遗憾,并提供了涉及人类-AI和人类-机器人合作的广泛实验,以验证我们的理论发现。我们的结果表明,拟议的伙伴-认识战略超越了其他已知的方法,我们的人类专题研究表明,人类更愿意与执行我们的伙伴-觉知战略的AI代理人合作。