This paper proposes a new reinforcement learning with hyperbolic discounting. Combining a new temporal difference error with the hyperbolic discounting in recursive manner and reward-punishment framework, a new scheme to learn the optimal policy is derived. In simulations, it is found that the proposal outperforms the standard reinforcement learning, although the performance depends on the design of reward and punishment. In addition, the averages of discount factors w.r.t. reward and punishment are different from each other, like a sign effect in animal behaviors.
翻译:本文提出一个新的强化学习,使用双曲折扣。 将新的时间差差差错误与累进式双曲折扣和奖赏-惩罚框架相结合,将产生一个学习最佳政策的新计划。 在模拟中,发现该提案优于标准强化学习,尽管其表现取决于奖赏和处罚的设计。 此外,贴现因子的平均值( w.r.t.)和惩罚是不同的,就像动物行为中的标志效应一样。