Classification bandits are multi-armed bandit problems whose task is to classify a given set of arms into either positive or negative class depending on whether the rate of the arms with the expected reward of at least h is not less than w for given thresholds h and w. We study a special classification bandit problem in which arms correspond to points x in d-dimensional real space with expected rewards f(x) which are generated according to a Gaussian process prior. We develop a framework algorithm for the problem using various arm selection policies and propose policies called FCB and FTSV. We show a smaller sample complexity upper bound for FCB than that for the existing algorithm of the level set estimation, in which whether f(x) is at least h or not must be decided for every arm's x. Arm selection policies depending on an estimated rate of arms with rewards of at least h are also proposed and shown to improve empirical sample complexity. According to our experimental results, the rate-estimation versions of FCB and FTSV, together with that of the popular active learning policy that selects the point with the maximum variance, outperform other policies for synthetic functions, and the version of FTSV is also the best performer for our real-world dataset.
翻译:土匪是多武装的土匪,其任务是将某一批武器分为正或负等级,取决于预期至少得到1小时的奖励的军火比率是否不低于某一阈值的重量。 我们研究一个特殊的分类土匪问题,即武器与D维实际空间的x点相对应,根据先前的Gaussian进程产生预期的回报 f(x) 。我们利用各种武器选择政策,为问题制定一个框架算法,并提出了称为FCB和FTSV的政策。我们显示,FCB的抽样复杂性比现有定值估计值的上限要小得多,其中F(x)是否至少必须确定每股的(x)值。还提议并显示,根据估计的武器比率至少得到1小时的奖励,武器选择政策可以改进经验样本的复杂性。根据我们的实验结果,FCB和FTSV的费率估计版本,以及选择最大差异点的流行积极学习政策,比合成功能的其他政策要差得多,而FTSV的版本也是我们真实世界数据的最佳表现。