We study contextual bandit (CB) problems, where the user can sometimes respond with the best action in a given context. Such an interaction arises, for example, in text prediction or autocompletion settings, where a poor suggestion is simply ignored and the user enters the desired text instead. Crucially, this extra feedback is user-triggered on only a subset of the contexts. We develop a new framework to leverage such signals, while being robust to their biased nature. We also augment standard CB algorithms to leverage the signal, and show improved regret guarantees for the resulting algorithms under a variety of conditions on the helpfulness of and bias inherent in this feedback.
翻译:我们研究了背景土匪(CB)问题,用户有时可以在特定情况下以最佳行动作出反应。例如,在文本预测或自动完成设置中,出现这种互动,其中对建议不正确的建议完全置之不理,而用户则进入了理想的文本。关键是,这种额外的反馈只是根据特定背景的一组情况而触发用户的。我们开发了一个新的框架来利用这些信号,同时保持其偏向性。我们还增加了标准的CB算法来利用信号,并在各种条件下为由此产生的算法提供了更好的遗憾保证,说明这种反馈的有用性和偏见。