Autonomous cyber and cyber-physical systems need to perform decision-making, learning, and control in unknown environments. Such decision-making can be sensitive to multiple factors, including modeling errors, changes in costs, and impacts of events in the tails of probability distributions. Although multi-agent reinforcement learning (MARL) provides a framework for learning behaviors through repeated interactions with the environment by minimizing an average cost, it will not be adequate to overcome the above challenges. In this paper, we develop a distributed MARL approach to solve decision-making problems in unknown environments by learning risk-aware actions. We use the conditional value-at-risk (CVaR) to characterize the cost function that is being minimized, and define a Bellman operator to characterize the value function associated to a given state-action pair. We prove that this operator satisfies a contraction property, and that it converges to the optimal value function. We then propose a distributed MARL algorithm called the CVaR QD-Learning algorithm, and establish that value functions of individual agents reaches consensus. We identify several challenges that arise in the implementation of the CVaR QD-Learning algorithm, and present solutions to overcome these. We evaluate the CVaR QD-Learning algorithm through simulations, and demonstrate the effect of a risk parameter on value functions at consensus.
翻译:自主的网络和网络物理系统需要在未知环境中进行决策、学习和控制。这样的决策可能受到多种因素的影响,包括建模误差、成本变化和在概率分布尾部事件的影响。虽然多智能体强化学习(MARL)通过与环境的重复交互来最小化平均成本提供了一种学习行为的框架,但它不能克服上述挑战。在本文中,我们发展了一种分布式的MARL方法,通过学习风险感知行为来解决未知环境中的决策问题。我们使用条件风险价值(CVaR)来表征被最小化的成本函数,并定义一个Bellman算子来表征与给定状态-操作对相关的价值函数。我们证明该算子满足收缩性质,并且收敛于最优价值函数。然后,我们提出了一种分布式MARL算法,称为CVaR QD-Learning算法,并确立了单个智能体的价值函数达成共识。我们确定了CVaR QD-Learning算法的实现中出现的几个挑战,并提出了克服这些挑战的解决方案。我们通过模拟评估了CVaR QD-Learning算法,并演示了风险参数对共识价值函数的影响。