Adopting reasonable strategies is challenging but crucial for an intelligent agent with limited resources working in hazardous, unstructured, and dynamic environments to improve the system utility, decrease the overall cost, and increase mission success probability. Deep Reinforcement Learning (DRL) helps organize agents' behaviors and actions based on their state and represents complex strategies (composition of actions). This paper proposes a novel hierarchical strategy decomposition approach based on Bayesian chaining to separate an intricate policy into several simple sub-policies and organize their relationships as Bayesian strategy networks (BSN). We integrate this approach into the state-of-the-art DRL method, soft actor-critic (SAC), and build the corresponding Bayesian soft actor-critic (BSAC) model by organizing several sub-policies as a joint policy. We compare the proposed BSAC method with the SAC and other state-of-the-art approaches such as TD3, DDPG, and PPO on the standard continuous control benchmarks -- Hopper-v2, Walker2d-v2, and Humanoid-v2 -- in MuJoCo with the OpenAI Gym environment. The results demonstrate that the promising potential of the BSAC method significantly improves training efficiency. The open sourced codes for BSAC can be accessed at https://github.com/herolab-uga/bsac.
翻译:采取合理的战略具有挑战性,但对于一个在危险、无结构、动态环境中工作的资源有限的智能代理机构来说,采取合理的战略具有挑战性,但对于一个在危险、无结构和动态环境中开展工作的资源有限的智能代理机构来说是至关重要的。深强化学习(DRL)有助于根据自身状况组织代理人的行为和行动,并代表了复杂的战略(行动组合)。本文建议采用新的等级化战略分解方法,其基础是将一项复杂的政策分为几个简单的次级政策,并将其关系组织成巴伊西亚战略网络(BSN)。我们把这一方法纳入最先进的DRL方法、软的行为者-critic(SAC),并通过组织若干次政策来建立相应的BAesian软行为者-critic(BSAC)模式。我们把拟议的BSAC方法与SAC系统和其他最先进的方法,如TD3、DDPG和PO关于标准持续控制基准 -- -- Hopper-v2、Walker2d-v2和Humanoid-v2 -- -- 在Mujoco采用最先进的D-DR-VCo-Fo-DL方法,我们将这一方法纳入最新的DOAIAIC-SyAx-Syb/Bs/Byb/ByAclus/Bs访问到Bs Apprentrus/Bs Apprencerencerus/Bs