In the stochastic contextual bandit setting, regret-minimizing algorithms have been extensively researched, but their instance-minimizing best-arm identification counterparts remain seldom studied. In this work, we focus on the stochastic bandit problem in the $(\epsilon,\delta)$-$\textit{PAC}$ setting: given a policy class $\Pi$ the goal of the learner is to return a policy $\pi\in \Pi$ whose expected reward is within $\epsilon$ of the optimal policy with probability greater than $1-\delta$. We characterize the first $\textit{instance-dependent}$ PAC sample complexity of contextual bandits through a quantity $\rho_{\Pi}$, and provide matching upper and lower bounds in terms of $\rho_{\Pi}$ for the agnostic and linear contextual best-arm identification settings. We show that no algorithm can be simultaneously minimax-optimal for regret minimization and instance-dependent PAC for best-arm identification. Our main result is a new instance-optimal and computationally efficient algorithm that relies on a polynomial number of calls to an argmax oracle.
翻译:在调查背景的土匪环境中,对后悔最小化的算法进行了广泛的研究,但是对最能最小化武器识别对应方的试想最小化的算法却很少进行研究。在这项工作中,我们把重点放在$(\ epsilon,\delta)$-$\ textit{PAC}美元设置的随机性土匪问题上:如果政策类别为$\Pi$,学习者的目标是返回一个政策类别$\pin\in\pi$/pi$的政策,其预期的回报在1美元/delta$以上的最佳政策中以美元为单位。我们通过一个数量 $\rho ⁇ Pi}美元来描述背景土匪的第一个 $\ texti- instest- adestable}$ PAC 样本复杂性。我们的主要结果是,一个高效的智能算法,一个高效的模型。