How much credit (or blame) should an action taken in a state get for a future reward? This is the fundamental temporal credit assignment problem in Reinforcement Learning (RL). One of the earliest and still most widely used heuristics is to assign this credit based on a scalar coefficient $\lambda$ (treated as a hyperparameter) raised to the power of the time interval between the state-action and the reward. In this empirical paper, we explore heuristics based on more general pairwise weightings that are functions of the state in which the action was taken, the state at the time of the reward, as well as the time interval between the two. Of course it isn't clear what these pairwise weight functions should be, and because they are too complex to be treated as hyperparameters we develop a metagradient procedure for learning these weight functions during the usual RL training of a policy. Our empirical work shows that it is often possible to learn these pairwise weight functions during learning of the policy to achieve better performance than competing approaches.
翻译:在一个州采取的某项行动应该有多少信用( 或责怪) 才能得到未来奖赏? 这是加强学习( RL)中基本的时时信用分配问题。 最早和最广泛使用的信息学之一是,根据一个标价系数 $\ lambda$( 被作为超参数处理) 来分配这一信用( 以国家行动与奖励之间的时间间隔为基础 ) 。 在这份经验性文件中, 我们探索了基于更一般的双对称加权法的超常主义, 它们是所采取行动的状态、 奖赏时间以及两者之间的时间间隔。 当然, 不清楚这些对称重函数应该是什么, 并且因为它们太复杂, 无法被看作超常参数。 我们的经验性工作表明, 在学习政策以取得比竞争性方法更好的业绩时, 往往可以学习这些对称权重函数。