How much credit (or blame) should an action taken in a state get for a future reward? This is the fundamental temporal credit assignment problem in Reinforcement Learning (RL). One of the earliest and still most widely used heuristics is to assign this credit based on a scalar coefficient, $\lambda$ (treated as a hyperparameter), raised to the power of the time interval between the state-action and the reward. In this empirical paper, we explore heuristics based on more general pairwise weightings that are functions of the state in which the action was taken, the state at the time of the reward, as well as the time interval between the two. Of course it isn't clear what these pairwise weight functions should be, and because they are too complex to be treated as hyperparameters we develop a metagradient procedure for learning these weight functions during the usual RL training of a policy. Our empirical work shows that it is often possible to learn these pairwise weight functions during learning of the policy to achieve better performance than competing approaches.
翻译:在一个州采取的某项行动应该有多少信用( 或责怪) 才能获得未来奖赏? 这是加强学习( RL)中基本的时时信用分配问题 。 最早而且仍然最广泛使用的超理论之一是, 依据一个标价系数, $\ lumbda$( 被作为超参数处理) 来分配这种信用, 将其提升到国家行动与奖励之间的时间间隔。 在这个经验性文件中, 我们探索了基于更一般的对称加权法的超常性, 它们是所采取行动的状态的功能, 奖赏时的状态, 以及两者之间的时间间隔。 当然, 不清楚这些对称重功能应该是什么, 并且因为它们太复杂了, 我们无法在通常的 RL 政策培训中开发一种超常化程序来学习这些重量函数。 我们的经验性工作表明, 在学习政策以取得比相互竞争的方法更好的业绩时, 往往可以学习这些对称权重功能。