In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic assortment optimization problem, where in every round a decision maker offers a subset (assortment) of products to a consumer, and observes their response. Consumers purchase products so as to maximize their utility. We assume that the products are described by a set of attributes and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior by means of the widely used Multinomial Logit (MNL) model, and consider the decision maker's problem of dynamically learning the model parameters, while optimizing cumulative revenue over the selling horizon $T$. Though this problem has attracted considerable attention in recent times, many existing methods and their theoretical performance depend on a problem dependent parameter which could be prohibitively large. In particular, existing algorithms for this problem have regret bounded by $O(\sqrt{\kappa d T})$, where $\kappa$ is a problem dependent constant that can have exponential dependency on the number of attributes. In this paper, we propose a new algorithm with a carefully designed exploration strategy and show that the regret is bounded by $O(\sqrt{dT} + \kappa)$, significantly improving the performance over existing methods.
翻译:在本文中,我们考虑的是MNL- Bandit问题的背景变体。 更具体地说, 我们考虑的是动态的分类优化问题, 每轮决策者都会向消费者提供一组产品( 保证), 并观察他们的反应。 消费者购买产品, 以便最大限度地扩大其效用。 我们假设产品由一组属性描述, 产品的平均效用是这些属性值的线性值。 我们用广泛使用的多数字逻辑( MNL) 模型来模拟消费者选择行为, 并考虑决策者动态学习模型参数的问题, 同时在销售地平线上优化累积收入的问题。 尽管这个问题最近引起了相当的注意, 但许多现有方法及其理论性能取决于一个可能令人无法接受的取决于问题参数。 特别是, 这一问题的现有算法被美元( sqrt tppa d T) 所约束, 其中, $\kappa 是一个可能对属性数有指数依赖的问题。 在本文中, 我们提出了一种精心设计的业绩分析策略。 {\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\