We study a non-parametric multi-armed bandit problem with stochastic covariates, where a key complexity driver is the smoothness of payoff functions with respect to covariates. Previous studies have focused on deriving minimax-optimal algorithms in cases where it is a priori known how smooth the payoff functions are. In practice, however, the smoothness of payoff functions is typically not known in advance, and misspecification of smoothness may severely deteriorate the performance of existing methods. In this work, we consider a framework where the smoothness of payoff functions is not known, and study when and how algorithms may adapt to unknown smoothness. First, we establish that designing algorithms that adapt to unknown smoothness of payoff functions is, in general, impossible. However, under a self-similarity condition (which does not reduce the minimax complexity of the dynamic optimization problem at hand), we establish that adapting to unknown smoothness is possible, and further devise a general policy for achieving smoothness-adaptive performance. Our policy infers the smoothness of payoffs throughout the decision-making process, while leveraging the structure of non-adaptive off-the-shelf policies. We establish that for problem settings with either differentiable or non-differentiable payoff functions this policy matches (up to a logarithmic scale) the regret rate that is achievable when the smoothness of payoffs is known a priori.
翻译:我们研究的是非参数性多臂匪盗问题, 其特点是, 关键的复杂性驱动因素是, 支付功能对共变功能的顺利性。 先前的研究侧重于在事先知道支付功能如何顺利的情况下, 得出最优的小型算法。 然而, 在实际中, 付款功能的顺利性通常事先并不为人所知, 光滑性的具体性可能会严重地恶化现有方法的性能。 在这项工作中, 我们考虑的是这样一个框架: 支付功能的顺利性不为人知, 并且研究算法何时和如何适应未知的平稳性。 首先, 我们确定, 设计算法, 适应支付功能的不为人知的顺利性, 一般来说是不可能的。 然而, 在自我相似性条件下( 这不会降低当前动态优化问题的小巧复杂性), 我们确定, 适应未知的平滑性可能严重恶化现有方法的性能。 在这项工作中, 我们的政策偏向于整个决策过程的平稳性偿付能力, 以及如何适应未知的平稳性平滑性。 我们确定, 之前的平滑性的政策是, 当我们决定性决策的平时, 调整性调整性的政策, 之前的平时, 我们的平时, 将不调整性决策性调整性调整性调整性 。