We study an extension of standard bandit problem in which there are R layers of experts. Multi-layered experts make selections layer by layer and only the experts in the last layer can play arms. The goal of the learning policy is to minimize the total regret in this hierarchical experts setting. We first analyze the case that total regret grows linearly with the number of layers. Then we focus on the case that all experts are playing Upper Confidence Bound (UCB) strategy and give several sub-linear upper bounds for different circumstances. Finally, we design some experiments to help the regret analysis for the general case of hierarchical UCB structure and show the practical significance of our theoretical results. This article gives many insights about reasonable hierarchical decision structure.
翻译:我们研究的是标准土匪问题的延伸问题,这里有R级专家。多层次专家按层进行选择,最后一层的专家可以玩武器。学习政策的目标是尽量减少在这种等级专家环境中的彻底遗憾。我们首先分析一个全遗憾随着层数而直线增长的案例。然后我们集中研究一个案例,即所有专家都玩高信任(UCB)战略,并且为不同情况提供若干次线性上限。最后,我们设计了一些实验来帮助对UCB结构的一般情况进行遗憾分析,并展示我们理论结果的实际意义。这一条为合理的等级决策结构提供了许多见解。