We consider the two-sided matching market with bandit learners. In the standard matching problem, users and providers are matched to ensure incentive compatibility via the notion of stability. However, contrary to the core assumption of the matching problem, users and providers do not know their true preferences a priori and must learn them. To address this assumption, recent works propose to blend the matching and multi-armed bandit problems. They establish that it is possible to assign matchings that are stable (i.e., incentive-compatible) at every time step while also allowing agents to learn enough so that the system converges to matchings that are stable under the agents' true preferences. However, while some agents may incur low regret under these matchings, others can incur high regret -- specifically, $\Omega(T)$ optimal regret where $T$ is the time horizon. In this work, we incorporate costs and transfers in the two-sided matching market with bandit learners in order to faithfully model competition between agents. We prove that, under our framework, it is possible to simultaneously guarantee four desiderata: (1) incentive compatibility, i.e., stability, (2) low regret, i.e., $O(\log(T))$ optimal regret, (3) fairness in the distribution of regret among agents, and (4) high social welfare.
翻译:我们考虑的是与土匪的双面匹配市场。 在标准匹配问题中,用户和供应商被匹配以确保通过稳定概念的激励兼容性。 但是,与匹配问题的核心假设相反,用户和供应商并不先验地知道他们真正的偏好,必须学习他们。 为了解决这一假设,最近的工作提议将匹配和多武装土匪问题混为一谈。他们确定,可以每一步都指定稳定(即激励与兼容)的匹配,同时允许代理商充分学习,以使系统与在代理商真正偏好下稳定的匹配相匹配。然而,虽然一些代理商可能对这些匹配产生低的遗憾,但其他人可能会感到非常遗憾 -- -- 具体地说,当美元为时空时,用美元(T)美元(T)最优的遗憾。在双面匹配市场中,我们将成本和转账纳入土匪学习者,以便忠实地模拟代理商之间的竞争。我们证明,在我们的框架下,可以同时保证四边匹配:(1) 激励兼容性,即稳定,(2) 低遗憾,高额分配(I. ) 和高额 。