Out of the rich family of generalized linear bandits, perhaps the most well studied ones are logisitc bandits that are used in problems with binary rewards: for instance, when the learner/agent tries to maximize the profit over a user that can select one of two possible outcomes (e.g., `click' vs `no-click'). Despite remarkable recent progress and improved algorithms for logistic bandits, existing works do not address practical situations where the number of outcomes that can be selected by the user is larger than two (e.g., `click', `show me later', `never show again', `no click'). In this paper, we study such an extension. We use multinomial logit (MNL) to model the probability of each one of $K+1\geq 2$ possible outcomes (+1 stands for the `not click' outcome): we assume that for a learner's action $\mathbf{x}_t$, the user selects one of $K+1\geq 2$ outcomes, say outcome $i$, with a multinomial logit (MNL) probabilistic model with corresponding unknown parameter $\bar{\boldsymbol\theta}_{\ast i}$. Each outcome $i$ is also associated with a revenue parameter $\rho_i$ and the goal is to maximize the expected revenue. For this problem, we present MNL-UCB, an upper confidence bound (UCB)-based algorithm, that achieves regret $\tilde{\mathcal{O}}(dK\sqrt{T})$ with small dependency on problem-dependent constants that can otherwise be arbitrarily large and lead to loose regret bounds. We present numerical simulations that corroborate our theoretical results.
翻译:在一般线性匪徒的丰富家族中 { 广泛线性土匪的丰富家族中,也许研究得最周密的是用于解决二进制奖励问题的对数土匪:例如,当学习者/代理人试图使能够选择两种可能结果之一的用户获得最大利润时(例如,“点击”对“不点击 ” ) 。尽管最近取得了显著的进步,后勤匪徒的算法也得到了改进,但现有的工程并没有解决用户可以选择的结果数量大于两个(例如,“点击”、“稍后向我展示”、“不再显示”、“不点击” )。在本文中,我们研究这样一个扩展。我们使用多数值logit(MNL) 来模拟每个K+1 geq 2美元可能结果的概率(+1是“不点击”的结果 :对于学习者的行动$\ mathbf{xxx} 基础,用户选择了$K+1+Geq_bral_bral_bral_al resulate resulate, ligal ligal_al_alx ligal ligal ligal_al bal_al_al ligal bal bal ligal ligal lix) maxnal bal ma_al bal maxnal max maxx maxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx