We study a novel setting in offline reinforcement learning (RL) where a number of distributed machines jointly cooperate to solve the problem but only one single round of communication is allowed and there is a budget constraint on the total number of information (in terms of bits) that each machine can send out. For value function prediction in contextual bandits, and both episodic and non-episodic MDPs, we establish information-theoretic lower bounds on the minimax risk for distributed statistical estimators; this reveals the minimum amount of communication required by any offline RL algorithms. Specifically, for contextual bandits, we show that the number of bits must scale at least as $\Omega(AC)$ to match the centralised minimax optimal rate, where $A$ is the number of actions and $C$ is the context dimension; meanwhile, we reach similar results in the MDP settings. Furthermore, we develop learning algorithms based on least-squares estimates and Monte-Carlo return estimates and provide a sharp analysis showing that they can achieve optimal risk up to logarithmic factors. Additionally, we also show that temporal difference is unable to efficiently utilise information from all available devices under the single-round communication setting due to the initial bias of this method. To our best knowledge, this paper presents the first minimax lower bounds for distributed offline RL problems.
翻译:我们研究的是离线强化学习(RL)的新环境,在离线强化学习(RL)中,一些分布式机器共同合作解决问题,但只允许单轮通信,而且每个机器能够发送的信息总量(按位数计算)在预算上受到限制。对于背景土匪的价值函数预测,以及分数和非分数的 MDP,我们为分布式统计估计器在微缩最大风险上设定了信息理论下限;这显示了任何离线RL算法所需的最低通信量。具体地说,对于背景土匪,我们显示比特数的数量必须至少相当于$\Omega(AC),以与集中式微缩最大最佳速率匹配,其中$A是行动的数量,$是上下文层面;与此同时,我们在MDP设置中也取得了类似的结果。此外,我们开发了基于最小度估计值的学习算法和蒙特-卡洛返回估计值,并提供敏锐的分析,表明它们能够达到最优化的风险,直到对调因素。此外,我们展示了至少达到对调因素的最佳风险的比值比例,我们从最初的最小分析方法上展示了目前的最佳偏差的方法。