以线性功能近似值尽量减少多机构拥堵成本 (Multi-Agent congestion cost minimization with linear function approximation)

This work considers multiple agents traversing a network from a source node to the goal node. The cost to an agent for traveling a link has a private as well as a congestion component. The agent's objective is to find a path to the goal node with minimum overall cost in a decentralized way. We model this as a fully decentralized multi-agent reinforcement learning problem and propose a novel multi-agent congestion cost minimization (MACCM) algorithm. Our MACCM algorithm uses linear function approximations of transition probabilities and the global cost function. In the absence of a central controller and to preserve privacy, agents communicate the cost function parameters to their neighbors via a time-varying communication network. Moreover, each agent maintains its estimate of the global state-action value, which is updated via a multi-agent extended value iteration (MAEVI) sub-routine. We show that our MACCM algorithm achieves a sub-linear regret. The proof requires the convergence of cost function parameters, the MAEVI algorithm, and analysis of the regret bounds induced by the MAEVI triggering condition for each agent. We implement our algorithm on a two node network with multiple links to validate it. We first identify the optimal policy, the optimal number of agents going to the goal node in each period. We observe that the average regret is close to zero for 2 and 3 agents. The optimal policy captures the trade-off between the minimum cost of staying at a node and the congestion cost of going to the goal node. Our work is a generalization of learning the stochastic shortest path problem.

翻译：这项工作将多个代理商从源节点到目标节点之间穿行一个网络。一个代理商旅行连接的成本既有私人的,也有堵塞部分。代理商的目标是以分散的方式找到一条通往目标节点的途径, 并以最低总成本分散方式找到。我们将此模型作为完全分散的多试剂强化学习问题模型, 并提出新的多试剂阻塞成本最小化算法( MACCM 算法 ) 。我们的 MACM 算法使用从源节点到全球成本功能的线性函数近似值过渡概率和全球成本功能。如果没有中央控制器, 并且为了保护隐私, 代理商通过时间变化的通信网络将成本函数参数传送到邻居。此外, 每个代理商都保持其全球州- 行动节点值的估计, 并且通过多试剂扩展的超值循环系统更新。我们的MACCM 算法实现了亚线性递增。证明要求成本函数参数、 MAEVI 算法以及由MAEVI 触发每个代理商的最低成本最小化条件引发的遗憾地学习。我们每个代理商在两个最优度上进行最优化的计算。我们在最优化的网络上, 最优化的路径中, 向最优化的顺序是最优化的, 最优化的最优化的递化的路径为最优化的路径。