We investigate the problem of cumulative regret minimization for individual sequence prediction with respect to the best expert in a finite family of size K under limited access to information. We assume that in each round, the learner can predict using a convex combination of at most p experts for prediction, then they can observe a posteriori the losses of at most m experts. We assume that the loss function is range-bounded and exp-concave. In the standard multi-armed bandits setting, when the learner is allowed to play only one expert per round and observe only its feedback, known optimal regret bounds are of the order O($\sqrt$ KT). We show that allowing the learner to play one additional expert per round and observe one additional feedback improves substantially the guarantees on regret. We provide a strategy combining only p = 2 experts per round for prediction and observing m $\ge$ 2 experts' losses. Its randomized regret (wrt. internal randomization of the learners' strategy) is of order O (K/m) log(K$\delta$ --1) with probability 1 -- $\delta$, i.e., is independent of the horizon T ("constant" or "fast rate" regret) if (p $\ge$ 2 and m $\ge$ 3). We prove that this rate is optimal up to a logarithmic factor in K. In the case p = m = 2, we provide an upper bound of order O(K 2 log(K$\delta$ --1)), with probability 1 -- $\delta$. Our strategies do not require any prior knowledge of the horizon T nor of the confidence parameter $\delta$. Finally, we show that if the learner is constrained to observe only one expert feedback per round, the worst-case regret is the "slow rate" $\Omega$($\sqrt$ KT), suggesting that synchronous observation of at least two experts per round is necessary to have a constant regret.
翻译:在标准多武装匪徒设置中,当学习者只允许每轮只玩一名专家并只观察其反馈时,已知的最佳遗憾界限是O($\sqrt$ KT)顺序。我们假设,在每轮中,学习者可以预测,用最多P专家的组合来预测,然后他们可以观察到最多M专家的损失。我们假设,损失函数是按范围设定的和按额计算。在标准多武装匪徒设置中,当学习者被允许每轮只玩一名专家,而且只观察其反馈时,已知的最佳遗憾界限是按O($\sqrt$)顺序排列的O($_sqdal_$k)。我们显示,让学习者每轮再玩一个专家的组合,然后观察一次的保证。我们只将P=2专家组合起来进行预测,然后观察2美元2 m.