环绕风险敏感Markov运动会:前瞻性政策设计和以迭代理由和累积预期理论进行反向回报学习 (Bounded Risk-Sensitive Markov Games: Forward Policy Design and Inverse Reward Learning with Iterative Reasoning and Cumulative Prospect Theory)

Classical game-theoretic approaches for multi-agent systems in both the forward policy design problem and the inverse reward learning problem often make strong rationality assumptions: agents perfectly maximize expected utilities under uncertainties. Such assumptions, however, substantially mismatch with observed humans' behaviors such as satisficing with sub-optimal, risk-seeking, and loss-aversion decisions. In this paper, we investigate the problem of bounded risk-sensitive Markov Game (BRSMG) and its inverse reward learning problem for modeling human realistic behaviors and learning human behavioral models. Drawing on iterative reasoning models and cumulative prospect theory, we embrace that humans have bounded intelligence and maximize risk-sensitive utilities in BRSMGs. Convergence analysis for both the forward policy design and the inverse reward learning problems are established under the BRSMG framework. We validate the proposed forward policy design and inverse reward learning algorithms in a navigation scenario. The results show that the behaviors of agents demonstrate both risk-averse and risk-seeking characteristics. Moreover, in the inverse reward learning task, the proposed bounded risk-sensitive inverse learning algorithm outperforms a baseline risk-neutral inverse learning algorithm by effectively recovering not only more accurate reward values but also the intelligence levels and the risk-measure parameters given demonstrations of agents' interactive behaviors.

翻译：在前方政策设计问题和反向奖赏学习问题中,多试剂系统的经典游戏理论方法往往产生强烈的合理性假设:代理人在不确定的情况下完全最大限度地实现预期的公用事业。然而,这些假设与观察到的人类行为大不匹配,例如与亚最佳、寻求风险和厌恶损失的决定相提并论。在本文中,我们调查了对风险敏感的Markov游戏(BRSMG)及其反向奖励学习模拟人类现实行为和学习人类行为模型的学习问题。根据反复的推理模型和累积的前景理论,我们认识到人类在BRSMGs中已经将情报和对风险敏感的公用事业捆绑在一起并最大化。在BRSMG框架下,对前方政策设计和反向奖励学习学习问题进行了一致分析。我们验证了拟议的前方政策设计和反向奖励在导航情景中的学习学习算法。结果显示,代理人的行为只显示了风险反向和风险寻求人类行为模式的特征。此外,在反向奖励的学习任务中,拟议的对风险敏感的风险敏感度风险敏感度在BRSM标准上,还提出了一种通过恢复中风险的准确性演算法的基线。