In the infinite-armed bandit problem, each arm's average reward is sampled from an unknown distribution, and each arm can be sampled further to obtain noisy estimates of the average reward of that arm. Prior work focuses on identifying the best arm, i.e., estimating the maximum of the average reward distribution. We consider a general class of distribution functionals beyond the maximum, and propose unified meta algorithms for both the offline and online settings, achieving optimal sample complexities. We show that online estimation, where the learner can sequentially choose whether to sample a new or existing arm, offers no advantage over the offline setting for estimating the mean functional, but significantly reduces the sample complexity for other functionals such as the median, maximum, and trimmed mean. The matching lower bounds utilize several different Wasserstein distances. For the special case of median estimation, we identify a curious thresholding phenomenon on the indistinguishability between Gaussian convolutions with respect to the noise level, which may be of independent interest.
翻译:在无尽武装土匪问题中,每个手臂的平均奖赏都是从未知的分布中抽样的,每个手臂都可以进一步抽样,以获得该手臂平均奖赏的噪音估计值。 先前的工作重点是确定最好的手臂,即估计平均奖赏分配的最大值。 我们考虑的是超出最大范围的分布功能的一般类别,并为离线和在线设置提出统一的元算法,实现最佳的抽样复杂性。 我们显示在线估算,即学习者可以依次选择是否采样一个新的或现有的手臂,在离线设置上没有优势来估计平均功能,但大大降低了其他功能的抽样复杂性,如中位、最大和中位平均值。 匹配的下界利用了不同的瓦塞斯坦距离。 对于中位估计的特殊案例,我们确定了高萨共进集团之间在噪音水平上的不易分辨现象,这也许具有独立的兴趣。