多重逻辑背景土匪的可追踪在线学习算法 (A Tractable Online Learning Algorithm for the Multinomial Logit Contextual Bandit)

In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where in every round a decision maker offers a subset (assortment) of products to a consumer, and observes their response. Consumers purchase products so as to maximize their utility. We assume that the products are described by a set of attributes and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior by means of the widely used Multinomial Logit (MNL) model, and consider the decision maker's problem of dynamically learning the model parameters, while optimizing cumulative revenue over the selling horizon $T$. Though this problem has attracted considerable attention in recent times, many existing methods often involve solving an intractable non-convex optimization problem and their theoretical performance guarantees depend on a problem dependent parameter which could be prohibitively large. In particular, existing algorithms for this problem have regret bounded by $O(\sqrt{\kappa d T})$, where $\kappa$ is a problem dependent constant that can have exponential dependency on the number of attributes. In this paper, we propose an optimistic algorithm and show that the regret is bounded by $O(\sqrt{dT} + \kappa)$, significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step which allows for tractable decision-making while retaining the favourable regret guarantee.

翻译：在本文中, 我们考虑的是MNL- Bandit 问题的背景变体。更具体地说, 我们考虑的是动态的组合优化问题, 每轮决策者都会向消费者提供一组产品( assorment), 并观察他们的反应。消费者购买产品, 以便最大限度地发挥产品效用。我们假设产品由一组属性描述, 产品的平均效用是这些属性值的线性值。我们用广泛使用的多数值逻辑( MNL) 模型来模拟消费者选择行为, 并且考虑决策者的动态学习模型参数的问题, 同时在销售地平线上优化累积收入。尽管这个问题最近引起了相当大的关注, 许多现有方法往往涉及解决棘手的非convex优化问题, 其理论性性性能保障取决于问题依赖的参数, 其范围可能令人望目惊心。特别是, 这一问题的现有算法被$O( sqrtrt) 所约束, $kappappa 是一个持续的问题, 而我们提出一个不断改进的稳度, 也就是, 我们提出一个稳定度的硬度。