This paper considers the partial monitoring problem with $k$-actions and $d$-outcomes and provides the first best-of-both-worlds algorithms, whose regrets are bounded poly-logarithmically in the stochastic regime and near-optimally in the adversarial regime. To be more specific, we show that for non-degenerate locally observable games, the regret in the stochastic regime is bounded by $O(k^3 m^2 \log(T) \log(k_{\Pi} T) / \Delta_{\mathrm{\min}})$ and in the adversarial regime by $O(k^{2/3} m \sqrt{T \log(T) \log k_{\Pi}})$, where $T$ is the number of rounds, $m$ is the maximum number of distinct observations per action, $\Delta_{\min}$ is the minimum optimality gap, and $k_{\Pi}$ is the number of Pareto optimal actions. Moreover, we show that for non-degenerate globally observable games, the regret in the stochastic regime is bounded by $O(\max\{c_{\mathcal{G}}^2 / k,\, c_{\mathcal{G}}\} \log(T) \log(k_{\Pi} T) / \Delta_{\min}^2)$ and in the adversarial regime by $O((\max\{c_{\mathcal{G}}^2 / k,\, c_{\mathcal{G}}\} \log(T) \log(k_{\Pi} T)))^{1/3} T^{2/3})$, where $c_{\mathcal{G}}$ is a game-dependent constant. Our algorithms are based on the follow-the-regularized-leader framework that takes into account the nature of the partial monitoring problem, inspired by algorithms in the field of online learning with feedback graphs.
翻译:本文考虑部分监测 $k$- action 和 $d- action 的问题, 提供了第一个双向最佳的算法, 其遗憾在随机系统中与多对数相联, 在对抗机制中则与近于最理想。 更具体地说, 我们显示, 对于非脱色的本地可见游戏, 随机系统中的遗憾是 $ (k) 3 m% 2\ log (T) (K) 平面 (K) (Q) =Q} =Q) (P} =Q) 反馈 T) /\ delta} mat- harm\ min } 提供第一个双向最佳算法的算法, 其悔意是 $ (k) 2\\ gqrt\ gqrts (T)\ gqration\\ g\ \ \ \ pí ⁇ 。 $ 美元是每个动作的最大观测次数, $\ \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ ma\\\\\\\ ma\ ma\ ma\ ma\ ma\ ma\ ma\ ma\ ma\ rode, rode, rode, rode, 以 lex lex lex k=