In this paper we initiate a study of non parametric contextual bandits under shape constraints on the mean reward function. Specifically, we study a setting where the context is one dimensional, and the mean reward function is isotonic with respect to this context. We propose a policy for this problem and show that it attains minimax rate optimal regret. Moreover, we show that the same policy enjoys automatic adaptation; that is, for subclasses of the parameter space where the true mean reward functions are also piecewise constant with $k$ pieces, this policy remains minimax rate optimal simultaneously for all $k \geq 1.$ Automatic adaptation phenomena are well-known for shape constrained problems in the offline setting; %The phenomenon of automatic adaptation of shape constrained methods is known to occur in offline problems; we show that such phenomena carry over to the online setting. The main technical ingredient underlying our policy is a procedure to derive confidence bands for an underlying isotonic function using the isotonic quantile estimator. The confidence band we propose is valid under heavy tailed noise, and its average width goes to $0$ at an adaptively optimal rate. We consider this to be an independent contribution to the isotonic regression literature.
翻译:在本文中,我们开始研究非参数背景强盗在平均奖赏功能的形状限制下的情况。 具体地说, 我们研究一个环境环境为一维, 而平均奖赏功能为此背景的偏移功能。 我们为这一问题提出了一个政策, 并表明它达到了最低最大速率最佳遗憾。 此外, 我们展示了同样的政策享有自动调整; 也就是说, 对于参数空间的亚类, 真正的平均奖赏功能也具有小数常数, 并且用美元点数表示, 这个政策对于所有 $k\geq 1. 美元 自动适应现象来说, 其最小速率仍然是最佳的。 自动适应现象在离线设置中因形状受限问题而广为人知;% 形状受限方法的自动适应现象在离线问题中会发生; 我们显示, 这种现象会传到网上环境。 我们政策的主要技术要素是利用异度量测算器生成基本异度函数的信任带。 我们建议的信任带在重尾噪音下是有效的, 其平均宽度以适应性最佳速度达0.0美元。 我们认为, 这是独立地回归法。