We present blind exploration and exploitation (BEE) algorithms for identifying the most reliable stochastic expert based on formulations that employ posterior sampling, upper-confidence bounds, empirical Kullback-Leibler divergence, and minmax methods for the stochastic multi-armed bandit problem. Joint sampling and consultation of experts whose opinions depend on the hidden and random state of the world becomes challenging in the unsupervised, or blind, framework as feedback from the true state is not available. We propose an empirically realizable measure of expert competence that can be inferred instantaneously using only the opinions of other experts. This measure preserves the ordering of true competences and thus enables joint sampling and consultation of stochastic experts based on their opinions on dynamically changing tasks. Statistics derived from the proposed measure is instantaneously available allowing both blind exploration-exploitation and unsupervised opinion aggregation. We discuss how the lack of supervision affects the asymptotic regret of BEE architectures that rely on UCB1, KL-UCB, MOSS, IMED, and Thompson sampling. We demonstrate the performance of different BEE algorithms empirically and compare them to their standard, or supervised, counterparts.
翻译:我们提出盲目的探索和开发算法(BEE),用以根据采用后继取样、上信任圈、实证的库列背-利博尔差异和对随机多武装土匪问题采用细巧方法的配方,确定最可靠的专家,根据这些配方,确定最可靠的随机专家; 联合采样和协商专家,他们的意见取决于世界的隐蔽和随机状态,在无人监督或盲目的框架里变得具有挑战性,因为没有来自真实状态的反馈; 我们提出一种经验上可实现的专家能力衡量标准,只能用其他专家的意见即时推断出来; 这一衡量标准保留了真实能力的排序,从而能够根据他们对动态变化任务的看法,联合采样和协商随机专家; 从拟议措施中得出的统计数据可即时获得,允许盲目的探索和开发以及不受监督的意见汇总; 我们讨论缺乏监督如何影响依赖UCB1、KL-UCB、MS、IME和汤普森抽样的B结构的无症状的遗憾。