Bandit problems with linear or concave reward have been extensively studied, but relatively few works have studied bandits with non-concave reward. This work considers a large family of bandit problems where the unknown underlying reward function is non-concave, including the low-rank generalized linear bandit problems and two-layer neural network with polynomial activation bandit problem. For the low-rank generalized linear bandit problem, we provide a minimax-optimal algorithm in the dimension, refuting both conjectures in [LMT21, JWWN19]. Our algorithms are based on a unified zeroth-order optimization paradigm that applies in great generality and attains optimal rates in several structured polynomial settings (in the dimension). We further demonstrate the applicability of our algorithms in RL in the generative model setting, resulting in improved sample complexity over prior approaches. Finally, we show that the standard optimistic algorithms (e.g., UCB) are sub-optimal by dimension factors. In the neural net setting (with polynomial activation functions) with noiseless reward, we provide a bandit algorithm with sample complexity equal to the intrinsic algebraic dimension. Again, we show that optimistic approaches have worse sample complexity, polynomial in the extrinsic dimension (which could be exponentially worse in the polynomial degree).
翻译:已经广泛研究了线性或 concave 奖赏问题,但相对较少的作品研究的是非concove 奖赏的土匪。 这项工作考虑到一个庞大的匪帮问题,其中未知的基本奖赏功能是非 concave 的,包括低级别普遍的线性强盗问题和双层神经网络,其中含有多元活性强盗匪问题。 对于低级别普遍线性强盗问题,我们提供了一种小型马克斯最佳算法,驳斥了[LMT21, JWWN19] 中的两种猜想。 我们的算法基于一个统一的零级优化模式,该模式非常笼统地适用,在一些结构化的多面性环境(层面)达到最佳的奖励率。 我们还展示了我们RL的算法在基因化模型设置中的适用性,从而改善了先前方法的样本复杂性。 最后,我们显示标准乐观算法(e.g., UCBB) 可能因维度因素而具有亚性。 在神经网设置(多面振动反应功能更差)中, 提供了无噪音的精确度的精确度, 我们再次展示了精度的复杂性。