Human environments are often regulated by explicit and complex rulesets. Integrating Reinforcement Learning (RL) agents into such environments motivates the development of learning mechanisms that perform well in rule-dense and exception-ridden environments such as autonomous driving on regulated roads. In this paper, we propose a method for organising experience by means of partitioning the experience buffer into clusters labelled on a per-explanation basis. We present discrete and continuous navigation environments compatible with modular rulesets and 9 learning tasks. For environments with explainable rulesets, we convert rule-based explanations into case-based explanations by allocating state-transitions into clusters labelled with explanations. This allows us to sample experiences in a curricular and task-oriented manner, focusing on the rarity, importance, and meaning of events. We label this concept Explanation-Awareness (XA). We perform XA experience replay (XAER) with intra and inter-cluster prioritisation, and introduce XA-compatible versions of DQN, TD3, and SAC. Performance is consistently superior with XA versions of those algorithms, compared to traditional Prioritised Experience Replay baselines, indicating that explanation engineering can be used in lieu of reward engineering for environments with explainable features.
翻译:将强化学习(RL)因素纳入这种环境,会推动建立学习机制,在规则严谨和有例外的环境中运作良好,如在受管制的公路上自主驾驶。在本文件中,我们提出一种方法,通过将经验缓冲分成按逐个分类标签的集群,将经验缓冲分成不同和连续的导航环境,与模块规则以及9项学习任务相容。对于有可解释规则的环境,我们将基于规则的解释转换成基于规则的解释,将国家过渡分配成有解释标签的集群。这使我们能够以课程和任务导向的方式,对经验进行抽样,侧重于事件的多样性、重要性和含义。我们将这一概念标注为解释-觉悟(XA)概念。我们进行XA经验重播(XAER),同时进行模块内和跨组间前置,并采用XA兼容的DQN、TD3和SAC等版本。 业绩与XA这些算法的版本相比,与传统的惯往常分级经验重现基线一致优于这些格式。我们把解释用于工程环境的替代性解释。