This study considers the partial monitoring problem with $k$-actions and $d$-outcomes and provides the first best-of-both-worlds algorithms, whose regrets are favorably bounded both in the stochastic and adversarial regimes. In particular, we show that for non-degenerate locally observable games, the regret is $O(m^2 k^4 \log(T) \log(k_{\Pi} T) / \Delta_{\min})$ in the stochastic regime and $O(m k^{2/3} \sqrt{T \log(T) \log k_{\Pi}})$ in the adversarial regime, where $T$ is the number of rounds, $m$ is the maximum number of distinct observations per action, $\Delta_{\min}$ is the minimum suboptimality gap, and $k_{\Pi}$ is the number of Pareto optimal actions. Moreover, we show that for globally observable games, the regret is $O(c_{\mathcal{G}}^2 \log(T) \log(k_{\Pi} T) / \Delta_{\min}^2)$ in the stochastic regime and $O((c_{\mathcal{G}}^2 \log(T) \log(k_{\Pi} T))^{1/3} T^{2/3})$ in the adversarial regime, where $c_{\mathcal{G}}$ is a game-dependent constant. We also provide regret bounds for a stochastic regime with adversarial corruptions. Our algorithms are based on the follow-the-regularized-leader framework and are inspired by the approach of exploration by optimization and the adaptive learning rate in the field of online learning with feedback graphs.
翻译:本研究考虑了与美元动作和美元出局的部分监测问题, 提供了首个双世界最佳算法, 这两种算法的歉意在随机和对立机制中都得到了良好的约束。 特别是, 我们显示, 对于非本地可见的脱色游戏来说, 遗憾是 $(m) 2 k ⁇ 4\log( T)\log( k ⁇ P} T) /\ Delta ⁇ min} 。 在沙沙沙制度中 和 美元(m) k ⁇ 2/3} \ sqrt{ T\log (T) 预感 k ⁇ P} 和 对抗机制中, $( log) k ⁇ 3\ pí 。 美元是每个动作的最大不同观测次数, $( Delta ⁇ min} $( 美元) 是最低亚优度差距, 美元是 Paretoal 方法 。 此外, 我们显示, 全球观测游戏时, T- cal2\\\\\\\\\\\\\\\\\\\ t yal yal yal yal yal ex, 在 Salal_ s dal= = $( ======= = ====================================xxxxx===================================================================================xxxxxxxxxxxx)