In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $\Omega(d\sqrt{T/K})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{O}(d\sqrt{T/K})$. We also provide instance-dependent minimax regret bounds under uniform rewards. Under non-uniform rewards, we prove a lower bound of $\Omega(d\sqrt{T})$ and an upper bound of $\tilde{O}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.
翻译:本文研究了上下文多类别逻辑(MNL)赌博机问题,其中学习智能体根据上下文信息顺序选择商品组合,而用户反馈遵循MNL选择模型。现有研究中的遗憾下界与上界之间存在显著差异,尤其是在最大组合规模$K$方面。此外,这些界限之间奖励结构的差异使得最优性探索变得复杂。在均匀奖励(即所有商品具有相同期望奖励)条件下,我们建立了$\Omega(d\sqrt{T/K})$的遗憾下界,并提出了一种常数时间算法OFU-MNL+,该算法实现了匹配的上界$\tilde{O}(d\sqrt{T/K})$。我们还给出了均匀奖励条件下的实例依赖极小极大遗憾界。在非均匀奖励条件下,我们证明了$\Omega(d\sqrt{T})$的下界和$\tilde{O}(d\sqrt{T})$的上界,该上界同样可由OFU-MNL+算法实现。我们的实证研究支持了这些理论发现。据我们所知,这是上下文MNL赌博机文献中首个证明极小极大最优性(无论是均匀还是非均匀奖励设置)并提出计算高效算法在忽略对数因子条件下达到该最优性的工作。