We consider the regret minimization task in a dueling bandits problem with context information. In every round of the sequential decision problem, the learner makes a context-dependent selection of two choice alternatives (arms) to be compared with each other and receives feedback in the form of noisy preference information. We assume that the feedback process is determined by a linear stochastic transitivity model with contextualized utilities (CoLST), and the learner's task is to include the best arm (with highest latent context-dependent utility) in the duel. We propose a computationally efficient algorithm, $\texttt{CoLSTIM}$, which makes its choice based on imitating the feedback process using perturbed context-dependent utility estimates of the underlying CoLST model. If each arm is associated with a $d$-dimensional feature vector, we show that $\texttt{CoLSTIM}$ achieves a regret of order $\tilde O( \sqrt{dT})$ after $T$ learning rounds. Additionally, we also establish the optimality of $\texttt{CoLSTIM}$ by showing a lower bound for the weak regret that refines the existing average regret analysis. Our experiments demonstrate its superiority over state-of-art algorithms for special cases of CoLST models.
翻译:我们认为,在与土匪的决斗中,最小化的任务与背景信息有关。在每一轮相继决定问题的每回合中,学习者根据背景选择了两种选择方案(武器),相互比较,并接受以吵闹的偏好信息为形式的反馈。我们假定,反馈进程是由具有背景化公用事业(CoLST)的线性随机过渡模式(CoLST)决定的,学习者的任务是在决斗中包括最好的手臂(具有最高潜伏背景效用)。我们提出一个计算效率高的算法,即$\ textt{ColSTIM}$,根据对基本COLST模型进行基于环境的实用性估计,以模拟反馈进程为基础作出选择。如果每个手臂与一个以美元为单位的功能矢量的直线性中转模式(CoLSTIM})相关,我们则表明,$trftlett{CoLSTIM}在学习回合后,以美元为最优的排序。此外,我们还确定了美元texttralt{ColSTIM} 的优化,展示了我们现有的低度实验。