DSDF: 在合作性多剂强化学习中处理随机剂的方法 (DSDF: An approach to handle stochastic agents in collaborative multi-agent reinforcement learning)

Multi-Agent reinforcement learning has received lot of attention in recent years and have applications in many different areas. Existing methods involving Centralized Training and Decentralized execution, attempts to train the agents towards learning a pattern of coordinated actions to arrive at optimal joint policy. However if some agents are stochastic to varying degrees of stochasticity, the above methods often fail to converge and provides poor coordination among agents. In this paper we show how this stochasticity of agents, which could be a result of malfunction or aging of robots, can add to the uncertainty in coordination and there contribute to unsatisfactory global coordination. In this case, the deterministic agents have to understand the behavior and limitations of the stochastic agents while arriving at optimal joint policy. Our solution, DSDF which tunes the discounted factor for the agents according to uncertainty and use the values to update the utility networks of individual agents. DSDF also helps in imparting an extent of reliability in coordination thereby granting stochastic agents tasks which are immediate and of shorter trajectory with deterministic ones taking the tasks which involve longer planning. Such an method enables joint co-ordinations of agents some of which may be partially performing and thereby can reduce or delay the investment of agent/robot replacement in many circumstances. Results on benchmark environment for different scenarios shows the efficacy of the proposed approach when compared with existing approaches.

翻译：近年来,多机构强化学习受到了很多关注,并在许多不同领域应用了多种强化学习。现有方法涉及集中化培训和分散执行,试图培训代理商学习协调行动的模式,以达成最佳联合政策。但如果某些代理商具有不同程度的随机性,上述方法往往无法汇合,而且给代理商提供协调不力。在本文件中,我们展示了这种代理商的随机性,这可能是机器人故障或老化的结果,如何增加协调中的不确定性,并导致全球协调不尽人意。在这种情况下,确定性代理商必须了解随机性代理商的行为和局限性,同时达成最佳联合政策。我们的解决办法,即根据不确定性来调和贴现因素的贴现因素DSDFF, 并使用价值更新单个代理商的效用网络。DSDF还帮助在协调中传授一定程度的可靠性,从而赋予随机性代理商的任务是即时的短轨迹,而确定性公司则需要较长时间规划。这种方法有助于联合对比性代理商的行为和效益,从而可以部分地比较某些投资代理商的进度。