We consider a multi-armed bandit problem motivated by situations where only the extreme values, as opposed to expected values in the classical bandit setting, are of interest. We propose distribution free algorithms using robust statistics and characterize the statistical properties. We show that the provided algorithms achieve vanishing extremal regret under weaker conditions than existing algorithms. Performance of the algorithms is demonstrated for the finite-sample setting using numerical experiments. The results show superior performance of the proposed algorithms compared to the well known algorithms.
翻译:我们认为,一个多武装的土匪问题,其起因是只有极端值才有意义,而不是古典土匪环境中的预期值。我们建议使用可靠的统计数据进行免费分配算法,并定性统计属性。我们表明,所提供的算法在比现有算法更弱的条件下实现了极端遗憾的消失。算法的运行表现在使用数字实验的有限抽样设定中。结果显示,与众所周知的算法相比,拟议算法的性能优于已知的算法。