高效率探索强化学习的多式奖得分 (Multimodal Reward Shaping for Efficient Exploration in Reinforcement Learning)

Maintaining the long-term exploration capability of the agent remains one of the critical challenges in deep reinforcement learning. A representative solution is to leverage reward shaping to provide intrinsic rewards for the agent to encourage exploration. However, most existing methods suffer from vanishing intrinsic rewards, which cannot provide sustainable exploration incentives. Moreover, they rely heavily on complex models and additional memory to record learning procedures, resulting in high computational complexity and low robustness. To tackle this problem, entropy-based methods are proposed to evaluate the global exploration performance, encouraging the agent to visit the state space more equitably. However, the sample complexity of estimating the state visitation entropy is prohibitive when handling environments with high-dimensional observations. In this paper, we introduce a novel metric entitled Jain's fairness index (JFI) to replace the entropy regularizer, which solves the exploration problem from a brand new perspective. In sharp contrast to the entropy regularizer, JFI is more computable and robust and can be easily applied generalized into arbitrary tasks. Furthermore, we leverage a variational auto-encoder (VAE) model to capture the life-long novelty of states, which is combined with the global JFI score to form multimodal intrinsic rewards. Finally, extensive simulation results demonstrate that our multimodal reward shaping (MMRS) method can achieve higher performance than other benchmark schemes.

翻译：保持代理人的长期勘探能力仍然是深层强化学习的关键挑战之一。一个具有代表性的解决方案是利用奖赏塑造来为代理人提供内在奖赏来鼓励勘探。然而,大多数现有方法都因内在奖赏的消失而丧失,无法提供可持续的勘探奖励。此外,它们严重依赖复杂的模型和更多的记忆来记录学习程序,导致高计算复杂性和低强度。为解决这一问题,提议了基于恒温的方法来评价全球勘探业绩,鼓励代理人更公平地访问国家空间。然而,在用高层次的观测处理环境时,估算国家访问聚合的抽样复杂性是令人望而生畏的。在本文中,我们引入了名为“Jain的公平指数”的新颖指标,以取代从品牌新角度解决了勘探问题的诱变常规化器。在与诱变常规化器截然不同的情况下,JFI更容易被广泛运用自动计算模型来捕捉到国家生命延续的新模式(VAE)模型,该模型将最终的IMFI奖得分数与其他IM标准相结合。