Online learning in a two-sided matching market, with demand side agents continuously competing to be matched with supply side (arms), abstracts the complex interactions under partial information on matching platforms (e.g. UpWork, TaskRabbit). We study the decentralized serial dictatorship setting, a two-sided matching market where the demand side agents have unknown and heterogeneous valuation over the supply side (arms), while the arms have known uniform preference over the demand side (agents). We design the first decentralized algorithm -- UCB with Decentralized Dominant-arm Deletion (UCB-D3), for the agents, that does not require any knowledge of reward gaps or time horizon. UCB-D3 works in phases, where in each phase, agents delete \emph{dominated arms} -- the arms preferred by higher ranked agents, and play only from the non-dominated arms according to the UCB. At the end of the phase, agents broadcast in a decentralized fashion, their estimated preferred arms through {\em pure exploitation}. We prove both, a new regret lower bound for the decentralized serial dictatorship model, and that UCB-D3 is order optimal.
翻译:在双面匹配的市场上进行在线学习,需求方代理商不断与供应方(武器)竞争,在匹配平台(例如UpWork、TattRabbbit)的部分信息下总结复杂的互动关系。我们研究了分散的连续独裁环境,这是一个双面匹配的市场,需求方代理商对供应方(武器)的估价不尽人意,而武器对需求方(代理商)有不同的偏好。我们为代理商设计了第一个分散的算法 -- -- UCB, 使用分散式武器删除(UCB-D3),这不需要对报酬差距或时间跨度有任何了解。UCB-D3分阶段工作,在每一个阶段,代理商删除\emph{以武器为主,根据UCB公司的说法,武器只从非主导型武器中播放。在阶段结束时,代理商以分散式广播,通过纯开采来估计他们喜欢的武器。我们证明,对分散式的连续独裁模式有新的遗憾,而UCB-D3是最佳秩序。