Reinforcement-learning agents seek to maximize a reward signal through environmental interactions. As humans, our contribution to the learning process is through designing the reward function. Like programmers, we have a behavior in mind and have to translate it into a formal specification, namely rewards. In this work, we consider the reward-design problem in tasks formulated as reaching desirable states and avoiding undesirable states. To start, we propose a strict partial ordering of the policy space. We prefer policies that reach the good states faster and with higher probability while avoiding the bad states longer. Next, we propose an environment-independent tiered reward structure and show it is guaranteed to induce policies that are Pareto-optimal according to our preference relation. Finally, we empirically evaluate tiered reward functions on several environments and show they induce desired behavior and lead to fast learning.
翻译:强化学习机构试图通过环境互动最大限度地增加奖励信号。作为人类,我们对学习过程的贡献是通过设计奖励功能。像程序员一样,我们有行为意识,必须将其转化为正式的规格,即奖赏。在这项工作中,我们认为在拟定的任务中,奖赏设计问题会达到理想国家,避免不受欢迎的国家。为了开始,我们建议严格地部分地排列政策空间。我们更喜欢更快、更可能地触及好国家的政策,同时避免坏国家的时间更长。接下来,我们提出一个环境独立的分级奖赏结构,并表明根据我们的偏好关系,我们保证它产生最佳的政策。最后,我们从经验上评价了几个环境中的分级奖赏功能,并表明它们诱发期望的行为并导致快速学习。