One of the fundamental challenges in reinforcement learning (RL) is the one of data efficiency: modern algorithms require a very large number of training samples, especially compared to humans, for solving environments with high-dimensional observations. The severity of this problem is increased when the reward signal is sparse. In this work, we propose learning a state representation in a self-supervised manner for reward prediction. The reward predictor learns to estimate either a raw or a smoothed version of the true reward signal in environment with a single, terminating, goal state. We augment the training of out-of-the-box RL agents by shaping the reward using our reward predictor during policy learning. Using our representation for preprocessing high-dimensional observations, as well as using the predictor for reward shaping, is shown to significantly enhance Actor Critic using Kronecker-factored Trust Region and Proximal Policy Optimization in single-goal environments with visual inputs.
翻译:强化学习(RL)的根本挑战之一是数据效率:现代算法需要大量培训样本,特别是相对于人类,以解决高维观测的环境。当奖赏信号稀少时,这一问题就更加严重。在这项工作中,我们提议以自我监督的方式学习国家代表制,以进行奖赏预测。奖赏预测员学会用单一的、终止的、目标状态来估计环境中真实奖赏信号的原始或顺利版本。我们在政策学习期间利用我们的奖赏预测来塑造奖赏,从而扩大对箱外RL代理的培训。利用我们的奖赏预测来预处理高维观,以及利用预测器来塑造奖赏,显示利用Kroncecker-crent Trust区域和有视觉投入的单一目标环境中的普罗克西马政策优化大大加强了Actor Critic。