We consider a dynamic assortment selection problem where the goal is to offer a sequence of assortments of cardinality at most $K$, out of $N$ items, to minimize the expected cumulative regret (loss of revenue). The feedback is given by a multinomial logit (MNL) choice model. This sequential decision making problem is studied under the MNL contextual bandit framework. The existing algorithms for MNL contexual bandit have frequentist regret guarantees as $\tilde{\mathrm{O}}(\kappa\sqrt{T})$, where $\kappa$ is an instance dependent constant. $\kappa$ could be arbitrarily large, e.g. exponentially dependent on the model parameters, causing the existing regret guarantees to be substantially loose. We propose an optimistic algorithm with a carefully designed exploration bonus term and show that it enjoys $\tilde{\mathrm{O}}(\sqrt{T})$ regret. In our bounds, the $\kappa$ factor only affects the poly-log term and not the leading term of the regret bounds.
翻译:我们考虑一个动态的分级选择问题,在这样的问题上,目标是从美元项目中以最多1K美元提供一系列最基本价值的分级,以尽量减少预期累积的遗憾(收入损失),反馈是由多数字逻辑(MNL)选择模式提供的。这个顺序决策问题是在MNL背景土匪框架下研究的。MNL contexual 土匪的现有算法经常以$tilde_mathrm{O ⁇ (\kappa\ sqrt{T}}$(kapa\ sqrt{T}$)为遗憾保证。 在我们的界限中, $\kappa$只是依赖一个实例常数常数。 $\kappa$可能是任意的, 例如指数依赖于模型参数, 导致现有的遗憾保证大大松散。 我们提出一个乐观的算法, 精心设计的勘探奖金期限为$tilde_mathrm{O\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\