Many goal-reaching reinforcement learning (RL) tasks have empirically verified that rewarding the agent on subgoals improves convergence speed and practical performance. We attempt to provide a theoretical framework to quantify the computational benefits of rewarding the completion of subgoals, in terms of the number of synchronous value iterations. In particular, we consider subgoals as one-way {\em intermediate states}, which can only be visited once per episode and propose two settings that consider these one-way intermediate states: the one-way single-path (OWSP) and the one-way multi-path (OWMP) settings. In both OWSP and OWMP settings, we demonstrate that adding {\em intermediate rewards} to subgoals is more computationally efficient than only rewarding the agent once it completes the goal of reaching a terminal state. We also reveal a trade-off between computational complexity and the pursuit of the shortest path in the OWMP setting: adding intermediate rewards significantly reduces the computational complexity of reaching the goal but the agent may not find the shortest path, whereas with sparse terminal rewards, the agent finds the shortest path at a significantly higher computational cost. We also corroborate our theoretical results with extensive experiments on the MiniGrid environments using Q-learning and some popular deep RL algorithms.
翻译:许多具有目标意义的强化学习(RL)任务经经验证实,在次级目标上奖励代理人可以提高趋同速度和实际业绩。我们试图提供一个理论框架,从同步值迭代的数量上量化分目标完成率的计算效益。特别是,我们把次级目标视为单向的中转状态,每集只能访问一次,并提议两个考虑到这些单向中间状态的设置:单向单向单向路径(OWSP)和单向多向路径(OWMP)设置。在OWSP和OWMP设置中,我们试图提供一个理论框架,以量化分目标完成率的计算效益,而不是仅仅在达到终点状态的目标后奖励代理人。我们还揭示计算复杂性与OWMP设置中最短路径之间的权衡:增加中间回报大大降低了实现目标的计算复杂性,但代理人可能找不到最短路径,而最后的回报则很少,我们证明,在次级目标中添加的中间奖项时,比在达到最短路径时,只有达到最高级的计算成本。我们用最高级的RG级算算。