Reinforcement Learning (RL) algorithms are known to scale poorly to environments with many available actions, requiring numerous samples to learn an optimal policy. The traditional approach of considering the same fixed action space in every possible state implies that the agent must understand, while also learning to maximize its reward, to ignore irrelevant actions such as $\textit{inapplicable actions}$ (i.e. actions that have no effect on the environment when performed in a given state). Knowing this information can help reduce the sample complexity of RL algorithms by masking the inapplicable actions from the policy distribution to only explore actions relevant to finding an optimal policy. While this technique has been formalized for quite some time within the Automated Planning community with the concept of precondition in the STRIPS language, RL algorithms have never formally taken advantage of this information to prune the search space to explore. This is typically done in an ad-hoc manner with hand-crafted domain logic added to the RL algorithm. In this paper, we propose a more systematic approach to introduce this knowledge into the algorithm. We (i) standardize the way knowledge can be manually specified to the agent; and (ii) present a new framework to autonomously learn the partial action model encapsulating the precondition of an action jointly with the policy. We show experimentally that learning inapplicable actions greatly improves the sample efficiency of the algorithm by providing a reliable signal to mask out irrelevant actions. Moreover, we demonstrate that thanks to the transferability of the knowledge acquired, it can be reused in other tasks and domains to make the learning process more efficient.
翻译:已知强化学习(RL)算法在很多现有行动的情况下,其规模极差,其规模不及环境,需要许多样本来学习最佳政策。在每一个可能的国家,考虑相同固定行动空间的传统方法意味着代理商必须理解,同时要学习最大限度的奖励,忽略不相关的行动,例如$\textit{inappelable Actions}$(即,在特定国家执行时对环境没有影响的行动)。了解这一信息有助于降低RL算法的样本复杂性,办法是掩盖政策分配中无法适用的行动,仅探索与寻找最佳政策相关的行动。虽然这一方法在自动规划界已经正式化了相当一段时间,并带有STRIP语言的先决条件概念,但RL算法从未正式利用这一信息来利用这一信息来利用搜索空间进行探索。这通常以手动版域逻辑方式进行,并在RL算法中添加这种逻辑。在这个文件中,我们建议一种更系统化的方法将这种知识引入算法中。我们(i)可以将知识方式标准化化为代理商所指定的模式;并且(ii)我们通过实验性模型向代理商者提供部分的模型学习新的行动,我们正在实验性地学习新的实验性行动。