The problem of two-sided matching markets is well-studied in computer science and economics, owing to its diverse applications across numerous domains. Since market participants are usually uncertain about their preferences in various online matching platforms, an emerging line of research is dedicated to the online setting where one-side participants (players) learn their unknown preferences through multiple rounds of interactions with the other side (arms). Sankararaman et al. provide an $Ω\left( \frac{N\log(T)}{Δ^2} + \frac{K\log(T)}Δ \right)$ regret lower bound for this problem under serial dictatorship assumption, where $N$ is the number of players, $K (\geq N)$ is the number of arms, $Δ$ is the minimum reward gap across players and arms, and $T$ is the time horizon. Serial dictatorship assumes arms have the same preferences, which is common in reality when one side participants have a unified evaluation standard. Recently, the work of Kong and Li proposes the ET-GS algorithm and achieves an $O\left( \frac{K\log(T)}{Δ^2} \right)$ regret upper bound, which is the best upper bound attained so far. Nonetheless, a gap between the lower and upper bounds, ranging from $N$ to $K$, persists. It remains unclear whether the lower bound or the upper bound needs to be improved. In this paper, we propose a multi-level successive selection algorithm that obtains an $O\left( \frac{N\log(T)}{Δ^2} + \frac{K\log(T)}Δ \right)$ regret bound when the market satisfies serial dictatorship. To the best of our knowledge, we are the first to propose an algorithm that matches the lower bound in the problem of matching markets with bandits.
翻译:双边匹配市场问题因其在众多领域的广泛应用,在计算机科学和经济学中得到了深入研究。由于市场参与者通常在各类在线匹配平台上对其偏好存在不确定性,一个新兴的研究方向致力于在线场景,其中一方参与者(玩家)通过与另一方(臂)的多轮交互来学习其未知偏好。Sankararaman等人基于序列独裁假设,为该问题提供了$Ω\left( \frac{N\log(T)}{Δ^2} + \frac{K\log(T)}Δ \right)$的遗憾下界,其中$N$为玩家数量,$K (\geq N)$为臂的数量,$Δ$为玩家与臂之间的最小奖励差距,$T$为时间范围。序列独裁假设臂具有相同的偏好,这在现实中当一方参与者拥有统一评估标准时是常见的。最近,Kong和Li的工作提出了ET-GS算法,实现了$O\left( \frac{K\log(T)}{Δ^2} \right)$的遗憾上界,这是目前获得的最佳上界。然而,下界与上界之间仍存在从$N$到$K$的差距。目前尚不清楚是需要改进下界还是上界。在本文中,我们提出了一种多级连续选择算法,当市场满足序列独裁条件时,该算法获得了$O\left( \frac{N\log(T)}{Δ^2} + \frac{K\log(T)}Δ \right)$的遗憾界。据我们所知,我们首次提出了在匹配市场赌博机问题中与下界匹配的算法。