We show that in a cooperative $N$-agent network, one can design locally executable policies for the agents such that the resulting discounted sum of average rewards (value) well approximates the optimal value computed over all (including non-local) policies. Specifically, we prove that, if $|\mathcal{X}|, |\mathcal{U}|$ denote the size of state, and action spaces of individual agents, then for sufficiently small discount factor, the approximation error is given by $\mathcal{O}(e)$ where $e\triangleq \frac{1}{\sqrt{N}}\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]$. Moreover, in a special case where the reward and state transition functions are independent of the action distribution of the population, the error improves to $\mathcal{O}(e)$ where $e\triangleq \frac{1}{\sqrt{N}}\sqrt{|\mathcal{X}|}$. Finally, we also devise an algorithm to explicitly construct a local policy. With the help of our approximation results, we further establish that the constructed local policy is within $\mathcal{O}(\max\{e,\epsilon\})$ distance of the optimal policy, and the sample complexity to achieve such a local policy is $\mathcal{O}(\epsilon^{-3})$, for any $\epsilon>0$.
翻译:具体地说,我们证明,如果一个特殊的案例中,如果奖励和状态转换功能独立于人口的行动分布之外,错误会增加到$mathcal{O}(e),那么近似误差就会由美元\triangleq\frac{1unsqrt{N_3left}_3left[sqrt\macal{sqrt{r>_macal{U ⁇ right}。此外,在一个特殊案例中,如果奖励和状态转换功能独立于人口的行动分布,错误就会增加到$mathcal{O}(e)$\tricalqrqrqqqqrqqrt}}$1\untsrtsrt{N_rd>_3left[sqrcal{xal{s_xl>_xlexal_xlational{sqr>x_macal_macal__lationalgal 政策中,我们进一步构建一个本地政策。