We consider a distributed reinforcement learning setting where multiple agents separately explore the environment and communicate their experiences through a central server. However, $\alpha$-fraction of agents are adversarial and can report arbitrary fake information. Critically, these adversarial agents can collude and their fake data can be of any sizes. We desire to robustly identify a near-optimal policy for the underlying Markov decision process in the presence of these adversarial agents. Our main technical contribution is Weighted-Clique, a novel algorithm for the robust mean estimation from batches problem, that can handle arbitrary batch sizes. Building upon this new estimator, in the offline setting, we design a Byzantine-robust distributed pessimistic value iteration algorithm; in the online setting, we design a Byzantine-robust distributed optimistic value iteration algorithm. Both algorithms obtain near-optimal sample complexities and achieve superior robustness guarantee than prior works.
翻译:我们考虑一个分布式强化学习环境,让多个代理商分别探索环境,并通过中央服务器交流他们的经验。 但是, $\alpha$对代理商的折射是对抗性的, 可以报告任意的假信息。 关键是, 这些对抗性代理商可以串通, 他们的假数据可以具有任何大小。 我们希望在这些对抗性代理商在场的情况下, 大力确定基底的Markov 决策程序的近最佳政策。 我们的主要技术贡献是 Weight-Clique, 这是一种新式算法, 用于对批量问题进行稳健的中值估算, 它可以处理任意的批量大小。 在离线设置中, 我们设计一个新的估计器, 我们设计一个比赞丁- RObust 分布的悲观性值迭代算法; 在网络设置中, 我们设计一个比赞丁- bust 分布的乐观值迭代算法。 这两种算法都获得接近最佳的样本复杂性, 并实现比先前工作更强的强的保证 。