We study the problem of nonstochastic bandits with expert advice, extending the setting from finitely many experts to any countably infinite set: A learner aims to maximize the total reward by taking actions sequentially based on bandit feedback while benchmarking against a set of experts. We propose a variant of Exp4.P that, for finitely many experts, enables inference of correct expert rankings while preserving the order of the regret upper bound. We then incorporate the variant into a meta-algorithm that works on infinitely many experts. We prove a high-probability upper bound of $\tilde{\mathcal{O}} \big( i^*K + \sqrt{KT} \big)$ on the regret, up to polylog factors, where $i^*$ is the unknown position of the best expert, $K$ is the number of actions, and $T$ is the time horizon. We also provide an example of structured experts and discuss how to expedite learning in such case. Our meta-learning algorithm achieves optimal regret up to polylog factors when $i^* = \tilde{\mathcal{O}} \big( \sqrt{T/K} \big)$. If a prior distribution is assumed to exist for $i^*$, the probability of optimality increases with $T$, the rate of which can be fast.
翻译:我们用专家建议研究非突击强盗问题,将限定的众多专家的设置扩大到可以计算到无限的一组:一个学习者的目的是通过根据强盗反馈采取顺序行动,同时参照一组专家的基准,最大限度地实现总奖赏;我们提出一个Exp4.P的变式,对少数专家来说,这样可以推断出正确的专家排名,同时保留遗憾上下限的顺序;然后将变式纳入一个无穷无尽的专家所能使用的元等级;我们证明,一个高概率上限是$\tilde_mathal{O ⁇ \bigh(iQK+\sqrt{KT}\big(iK+\sqrt{KT}\big)$,再加到多元因素,其中美元是最佳专家的未知位置,美元是行动的数量,而美元是时间跨度。我们还举了一个结构化专家的例子,并讨论如何在这类情况下加快学习。当美元-K_Q_BAR_Q_BAR_Q_BAR_BAR_B__BAR___BAR_BAR__B______________BAR_BAR_______BAR_____________Q___________________________________________________________________________________________________________________________________________________________________________________________________________________________________