We study the problem of nonstochastic bandits with infinitely many experts: A learner aims to maximize the total reward by taking actions sequentially based on bandit feedback while benchmarking against a countably infinite set of experts. We propose a variant of Exp4.P that, for finitely many experts, enables inference of correct expert rankings while preserving the order of the regret upper bound. We then incorporate the variant into a meta-algorithm that works on infinitely many experts. We prove a high-probability upper bound of $\tilde{\mathcal{O}} \big( i^*K + \sqrt{KT} \big)$ on the regret, up to polylog factors, where $i^*$ is the unknown position of the best expert, $K$ is the number of actions, and $T$ is the time horizon. We also provide an example of structured experts and discuss how to expedite learning in such case. Our meta-learning algorithm achieves the tightest regret upper bound for the setting considered when $i^* = \tilde{\mathcal{O}} \big( \sqrt{T/K} \big)$. If a prior distribution is assumed to exist for $i^*$, the probability of satisfying a tight regret bound increases with $T$, the rate of which can be fast.
翻译:我们用无数专家研究无孔不入的强盗问题:一个学习者的目的是通过根据强盗反馈按顺序采取行动来最大限度地提高总奖赏,同时对可数无限的专家群进行基准衡量。我们建议了一个Exp4.P的变式,该变式对少数专家来说,可以推断正确的专家排名,同时保留上层悔恨的顺序。我们然后将变式纳入一个对无限多专家起作用的元等级中。我们证明,在遗憾上,我们是一个高概率上限,即$-T=$=美元=美元=sqrockalt{KT}\big。对于当美元=美元=美元=美元=美元=美元=smacrtal{KT}时考虑时,我们最强烈的遗憾上限是:美元是最佳专家的未知位置,美元是行动的数量,而美元是时间框架。我们还举了一个结构化专家的例子,并讨论如何在这类情况下加快学习。我们的新学习算算算算法在考虑的设置时,当美元=美元=美元=美元=QQQQQQ___Brmamas a transad dead delideal lade a firde a fir delide a delistal delist ex ax ax ax rlde ax ax ax ax ax ax rl ax ex ax ex a tri)