In this paper, we propose Value Iteration Network for Reward Shaping (VIN-RS), a potential-based reward shaping mechanism using Convolutional Neural Network (CNN). The proposed VIN-RS embeds a CNN trained on computed labels using the message passing mechanism of the Hidden Markov Model. The CNN processes images or graphs of the environment to predict the shaping values. Recent work on reward shaping still has limitations towards training on a representation of the Markov Decision Process (MDP) and building an estimate of the transition matrix. The advantage of VIN-RS is to construct an effective potential function from an estimated MDP while automatically inferring the environment transition matrix. The proposed VIN-RS estimates the transition matrix through a self-learned convolution filter while extracting environment details from the input frames or sampled graphs. Due to (1) the previous success of using message passing for reward shaping; and (2) the CNN planning behavior, we use these messages to train the CNN of VIN-RS. Experiments are performed on tabular games, Atari 2600 and MuJoCo, for discrete and continuous action space. Our results illustrate promising improvements in the learning speed and maximum cumulative reward compared to the state-of-the-art.
翻译:在本文中,我们提议利用进化神经网络(CNN-RS),作为利用进化神经网络(CNN-RS)建立潜在奖赏塑造机制(VIN-RS)的一个潜在奖赏机制(VIN-RS)。拟议的VIN-RS将使用隐藏的Markov模型的信息传递机制在计算标签方面受过训练的CNN内嵌入一个CNN。CNN处理环境图像或图表以预测塑造值。最近关于奖赏塑造的工作仍然限制于对Markov决策过程(MDP)的表述和建立过渡矩阵估计值的培训。VIN-RS的优势在于从估计的MDP中构建一个有效的潜在功能,同时自动推断环境转型矩阵。拟议的VIN-RS通过自我学习的演进过滤器对过渡矩阵进行估算,同时从输入框架或抽样图表中提取环境细节。由于(1) 先前成功地利用信息传递来塑造奖赏;(2) CNN的规划行为,我们利用这些信息来培训VIN-RS的CNN。在表格游戏、Atari 2600和MuJooco中进行实验,以独立和持续的行动空间为例。我们的成果表明在学习速度和累积方面有希望的改进。