We study a non-parametric multi-armed bandit problem with stochastic covariates, where a key complexity driver is the smoothness of payoff functions with respect to covariates. Previous studies have focused on deriving minimax-optimal algorithms in cases where it is a priori known how smooth the payoff functions are. In practice, however, the smoothness of payoff functions is typically not known in advance, and misspecification of smoothness may severely deteriorate the performance of existing methods. In this work, we consider a framework where the smoothness of payoff functions is not known, and study when and how algorithms may adapt to unknown smoothness. First, we establish that designing algorithms that adapt to unknown smoothness of payoff functions is, in general, impossible. However, under a self-similarity condition (which does not reduce the minimax complexity of the dynamic optimization problem at hand), we establish that adapting to unknown smoothness is possible, and further devise a general policy for achieving smoothness-adaptive performance. Our policy infers the smoothness of payoffs throughout the decision-making process, while leveraging the structure of off-the-shelf non-adaptive policies. We establish that for problem settings with either differentiable or non-differentiable payoff functions, this policy matches (up to a logarithmic scale) the regret rate that is achievable when the smoothness of payoffs is known a priori.
翻译:我们研究的是非参数性多臂匪盗问题, 其特点是, 关键的复杂性驱动因素是, 支付功能对共变功能的顺利性。 先前的研究侧重于在事先知道支付功能的顺利性的情况下, 得出最优的迷你算法。 然而, 在实践上, 付款功能的顺利性通常事先并不为人所知, 光滑性的具体性可能会严重恶化现有方法的性能。 在这项工作中, 我们考虑的是, 一个关键的复杂性驱动因素是, 支付功能的顺利性不为人知, 并研究算法何时和如何适应未知的平稳性。 首先, 我们确定, 设计算法, 适应支付功能的不为人知的顺利性, 一般来说是不可能的。 然而, 在一个自相类似的条件下( 它不会降低当前动态优化问题的最低复杂性), 我们确定, 适应未知的顺畅顺差性可能严重地恶化现有方法的绩效, 进一步制定总体政策。 我们的政策将整个决策过程的平稳性偿付能力比得平滑, 而我们确定之前的平滑性的政策则是, 使先前的支付政策在不易的汇率上形成一个可实现的逻辑。