Contextual bandits are canonical models for sequential decision-making under uncertainty in environments with time-varying components. In this setting, the expected reward of each bandit arm consists of the inner product of an unknown parameter with the context vector of that arm. The classical bandit settings heavily rely on assuming that the contexts are fully observed, while study of the richer model of imperfectly observed contextual bandits is immature. This work considers Greedy reinforcement learning policies that take actions as if the current estimates of the parameter and of the unobserved contexts coincide with the corresponding true values. We establish that the non-asymptotic worst-case regret grows poly-logarithmically with the time horizon and the failure probability, while it scales linearly with the number of arms. Numerical analysis showcasing the above efficiency of Greedy policies is also provided.
翻译:在具有时间变化成分的环境中,背景强盗是在不稳定的环境中进行顺序决策的典型模式。在这一背景下,每只土匪臂的预期奖赏是由一个未知参数及其上下文矢量的内在产物构成的。古典土匪环境在很大程度上依赖于假设环境得到充分的观察,而对不完全观察背景强盗的较丰富模型的研究是不成熟的。这项工作考虑到贪婪的强化学习政策,即采取行动,仿佛当前对参数和未观测环境的估计与相应的真实价值相吻合。我们确定,非零食性最坏的遗憾随着时间范围和失败概率而增长多孔,同时以武器数量为线性尺度。还提供了数值分析,显示贪婪政策的效率高于以上。