A core element in decision-making under uncertainty is the feedback on the quality of the performed actions. However, in many applications, such feedback is restricted. For example, in recommendation systems, repeatedly asking the user to provide feedback on the quality of recommendations will annoy them. In this work, we formalize decision-making problems with querying budget, where there is a (possibly time-dependent) hard limit on the number of reward queries allowed. Specifically, we consider multi-armed bandits, linear bandits, and reinforcement learning problems. We start by analyzing the performance of `greedy' algorithms that query a reward whenever they can. We show that in fully stochastic settings, doing so performs surprisingly well, but in the presence of any adversity, this might lead to linear regret. To overcome this issue, we propose the Confidence-Budget Matching (CBM) principle that queries rewards when the confidence intervals are wider than the inverse square root of the available budget. We analyze the performance of CBM based algorithms in different settings and show that they perform well in the presence of adversity in the contexts, initial states, and budgets.
翻译:在不确定的情况下,决策的核心要素是对所采取行动的质量的反馈。然而,在许多应用中,这种反馈是有限的。例如,在建议系统中,反复要求用户对建议的质量提供反馈会令它们感到烦恼。在这项工作中,我们在询问预算时正式提出决策问题,因为允许的奖励查询数量有(可能取决于时间的)硬性限制。具体地说,我们考虑的是多武装强盗、线性强盗和强化学习问题。我们从分析“greedy”算法的绩效开始,只要它们能够,就提出奖赏。我们表明,在完全随机的环境下,这样做令人惊讶地很好,但在出现任何逆差的情况下,这可能导致线性遗憾。为了克服这一问题,我们提议“信任-预算匹配”原则,即当信任期比现有预算的倒方根要宽时,查询奖励。我们分析基于CBM算法在不同环境中的表现,并表明它们在环境、初始状态和预算中出现逆差时表现良好。