Designing optimal reward functions has been desired but extremely difficult in reinforcement learning (RL). When it comes to modern complex tasks, sophisticated reward functions are widely used to simplify policy learning yet even a tiny adjustment on them is expensive to evaluate due to the drastically increasing cost of training. To this end, we propose a hindsight reward tweaking approach by designing a novel paradigm for deep reinforcement learning to model the influences of reward functions within a near-optimal space. We simply extend the input observation with a condition vector linearly correlated with the effective environment reward parameters and train the model in a conventional manner except for randomizing reward configurations, obtaining a hyper-policy whose characteristics are sensitively regulated over the condition space. We demonstrate the feasibility of this approach and study one of its potential application in policy performance boosting with multiple MuJoCo tasks.
翻译:设计最佳奖赏功能是可取的,但在强化学习方面极为困难。在现代复杂的任务方面,复杂的奖赏功能被广泛用于简化政策学习,但由于培训费用急剧增加,即使微小的调整也非常昂贵,以评价这些功能。为此,我们提议采取事后见识奖励办法,设计一种新型的强化学习模式,以在接近最佳的空间里模拟奖赏功能的影响。我们只是扩大投入观测,附带条件矢量,与有效的环境奖赏参数有线性联系,并以常规方式对模式进行培训,但随机调整奖励配置除外,获得对条件空间特点有敏感调控的超级政策。我们展示了这一办法的可行性,并研究其在政策执行中可能应用的方法之一,以多种 MuJoCo任务促进政策执行。