速率-最佳环境环境在线匹配强盗</s> (Rate-Optimal Contextual Online Matching Bandit)

Two-sided online matching platforms have been employed in various markets. However, agents' preferences in present market are usually implicit and unknown and must be learned from data. With the growing availability of side information involved in the decision process, modern online matching methodology demands the capability to track preference dynamics for agents based on their contextual information. This motivates us to consider a novel Contextual Online Matching Bandit prOblem (COMBO), which allows dynamic preferences in matching decisions. Existing works focus on multi-armed bandit with static preference, but this is insufficient: the two-sided preference changes as along as one-side's contextual information updates, resulting in non-static matching. In this paper, we propose a Centralized Contextual - Explore Then Commit (CC-ETC) algorithm to adapt to the COMBO. CC-ETC solves online matching with dynamic preference. In theory, we show that CC-ETC achieves a sublinear regret upper bound O(log(T)) and is a rate-optimal algorithm by proving a matching lower bound. In the experiments, we demonstrate that CC-ETC is robust to variant preference schemes, dimensions of contexts, reward noise levels, and contexts variation levels.

翻译：在不同市场上采用了双面在线匹配平台。但是,当前市场的代理商偏好通常是隐含的和未知的,必须从数据中学习。随着决策过程中的侧面信息越来越多,现代在线匹配方法要求具备根据背景信息跟踪代理商偏好动态的能力。这促使我们考虑一种新的环境在线匹配大盗大盗大盗大盗大盗大案(COMBO),允许在匹配决定中提供动态偏好。现有的工程侧重于具有静态偏好的多臂强盗,但这还不够:在单面背景信息更新的同时,双面偏好变化,导致非静态匹配。在本文中,我们提出了一种中央化背景 — 探索(CC- ETC) 算法,以适应COMBO。 CC- ETC 解决在线匹配动态偏好的问题。在理论上,我们显示CC- ETC 实现了亚线性遗憾高约束O(log(T) ), 并且是一种比率- 最佳算法, 证明匹配较低约束。在实验中, 我们证明CC-ETC 强于变式的优惠计划、范围、背景环境、和奖励等级。</s>