We consider a sequential assortment selection problem where the user choice is given by a multinomial logit (MNL) choice model whose parameters are unknown. In each period, the learning agent observes a $d$-dimensional contextual information about the user and the $N$ available items, and offers an assortment of size $K$ to the user, and observes the bandit feedback of the item chosen from the assortment. We propose upper confidence bound based algorithms for this MNL contextual bandit. The first algorithm is a simple and practical method which achieves an $\tilde{\mathcal{O}}(d\sqrt{T})$ regret over $T$ rounds. Next, we propose a second algorithm which achieves a $\tilde{\mathcal{O}}(\sqrt{dT})$ regret. This matches the lower bound for the MNL bandit problem, up to logarithmic terms, and improves on the best known result by a $\sqrt{d}$ factor. To establish this sharper regret bound, we present a non-asymptotic confidence bound for the maximum likelihood estimator of the MNL model that may be of independent interest as its own theoretical contribution. We then revisit the simpler, significantly more practical, first algorithm and show that a simple variant of the algorithm achieves the optimal regret for a broad class of important applications.
翻译:我们考虑一个顺序排序选择问题, 用户选择是由多名日志( MNL) 选择模型给出的, 其参数未知。 学习代理器在每一时期都观察关于用户和可用项目$N$的以美元为单位的背景信息, 并向用户提供大小为 $K$的排序, 并观察从排列列表中选择的项目的带宽反馈。 我们为这个 MNL 背景色条带提出了基于上层信任的基于信任的算法 。 第一个算法是一种简单实用的方法, 实现$\ tilde_mathcal{O} (d\qrt{T}) $(d\\qrt{T}) 以T$为单位的 。 接下来, 我们提出第二个算法, 实现$tilde_ maxal ligal ligal reformormormal, 也就是我们提出一个最起码的、 最起码的、 最起码的、 最起码的、 最起码的、 最起码的、 最起码的、 最起码的、最起码的、最起码的, 我们提出一个最起码的、最起码的、最起码的、最起码的、最起码的, 最起码的、最有的、最可靠的的、最起码的、最有的、最有的、最起码的。