Reinforcement Learning (RL) algorithms are known to scale poorly to environments with many available actions, requiring numerous samples to learn an optimal policy. The traditional approach of considering the same fixed action space in every possible state implies that the agent must understand, while also learning to maximize its reward, to ignore irrelevant actions such as $\textit{inapplicable actions}$ (i.e. actions that have no effect on the environment when performed in a given state). Knowing this information can help reduce the sample complexity of RL algorithms by masking the inapplicable actions from the policy distribution to only explore actions relevant to finding an optimal policy. This is typically done in an ad-hoc manner with hand-crafted domain logic added to the RL algorithm. In this paper, we propose a more systematic approach to introduce this knowledge into the algorithm. We (i) standardize the way knowledge can be manually specified to the agent; and (ii) present a new framework to autonomously learn these state-dependent action constraints jointly with the policy. We show experimentally that learning inapplicable actions greatly improves the sample efficiency of the algorithm by providing a reliable signal to mask out irrelevant actions. Moreover, we demonstrate that thanks to the transferability of the knowledge acquired, it can be reused in other tasks to make the learning process more efficient.
翻译:已知强化学习(RL)算法在很多现有行动的情况下,其规模极差,无法扩大到环境环境,需要许多样本来学习最佳政策。在每一个可能的状态中,考虑相同固定行动空间的传统方法意味着代理人必须理解,同时要学习最大限度的奖励,忽略不相关的行动,例如$\textit{inadable Actions}$(即在特定状态中执行时对环境没有影响的行动)。了解这一信息有助于减少RL算法的样本复杂性,办法是掩盖政策分配中无法适用的行动,只探索与寻找最佳政策相关的行动。这通常是以临时方式进行的,在RL算法中增加了手制域逻辑。在本文中,我们建议一种更系统的方法将这种知识引入算法中。我们(一) 将知识的手动方式标准化,可以指定给代理人;和(二) 提出一个新的框架,以自主地学习这些依赖状态的行动制约,与政策联合进行。我们实验性地表明,无法适用的行动可以大大提高算法的样本效率,提供可靠的信号,以掩盖不相关的行动。此外,我们要学习其他不相关的再使用。