In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where a decision-maker offers a subset (assortment) of products to a consumer and observes their response in every round. Consumers purchase products to maximize their utility. We assume that a set of attributes describes the products, and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the decision maker problem of dynamically learning the model parameters while optimizing cumulative revenue over the selling horizon $T$. Though this problem has attracted considerable attention in recent times, many existing methods often involve solving an intractable non-convex optimization problem. Their theoretical performance guarantees depend on a problem-dependent parameter which could be prohibitively large. In particular, existing algorithms for this problem have regret bounded by $O(\sqrt{\kappa d T})$, where $\kappa$ is a problem-dependent constant that can have an exponential dependency on the number of attributes. In this paper, we propose an optimistic algorithm and show that the regret is bounded by $O(\sqrt{dT} + \kappa)$, significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step, which allows for tractable decision-making while retaining the favourable regret guarantee.
翻译:在本文中,我们考虑的是MNL- Bandit 问题的背景变体。 更具体地说, 我们考虑的是动态的设定优化问题, 即决策者向消费者提供一组产品( assorment), 并观察每轮产品的反应。 消费者购买产品以尽量扩大它们的效用。 我们假设一系列属性描述产品, 产品的平均效用是这些属性值的线性值。 我们用广泛使用的多数值Logit( MNL) 模型来模拟消费者选择行为, 并且考虑决策人动态地学习模型参数的问题, 同时又在销售地平线上优化累积收入。 尽管这个问题最近引起了相当的注意, 许多现有方法往往涉及解决棘手的非cavex优化问题。 消费者的理论性能保证取决于一个取决于问题的参数, 而这些参数可能过于庞大。 特别是, 这一问题的现有算法被$O( sqrt kppa d T} 美元( maility) 捆绑定了, 其中, $kapappa 是一个依赖问题常态, 能够对属性数量有指数的依赖性依赖, 而我们提议的平整数。