Exogenous state variables and rewards can slow reinforcement learning by injecting uncontrolled variation into the reward signal. This paper formalizes exogenous state variables and rewards and shows that if the reward function decomposes additively into endogenous and exogenous components, the MDP can be decomposed into an exogenous Markov Reward Process (based on the exogenous reward) and an endogenous Markov Decision Process (optimizing the endogenous reward). Any optimal policy for the endogenous MDP is also an optimal policy for the original MDP, but because the endogenous reward typically has reduced variance, the endogenous MDP is easier to solve. We study settings where the decomposition of the state space into exogenous and endogenous state spaces is not given but must be discovered. The paper introduces and proves correctness of algorithms for discovering the exogenous and endogenous subspaces of the state space when they are mixed through linear combination. These algorithms can be applied during reinforcement learning to discover the exogenous space, remove the exogenous reward, and focus reinforcement learning on the endogenous MDP. Experiments on a variety of challenging synthetic MDPs show that these methods, applied online, discover large exogenous state spaces and produce substantial speedups in reinforcement learning.
翻译:外生状态和奖励可以通过注入不受控制的变化来减缓强化学习。本文将外生状态和奖励形式化,并表明如果奖励函数可以分解为内生成分和外生成分,则MDP可以分解为外生马尔可夫奖励过程(基于外生奖励)和内生马尔可夫决策过程(优化内生奖励)。内生MDP的任何最优策略也是原始MDP的最优策略,但由于内生奖励通常具有降低的方差,内生MDP更容易求解。我们研究了状态空间分解为内生和外生空间的情况,该空间必须被发现。当状态空间被线性组合时,本文介绍并证明了发现内生和外生状态空间的算法的正确性。这些算法可以应用于强化学习期间,以发现外生空间,移除外生奖励,并将强化学习放在内生MDP上。在各种具有挑战性的人工MDP实验中,这些方法在线发现大的外生状态空间,并显著加速了强化学习历程。