The endeavor of artificial intelligence (AI) is to design autonomous agents capable of achieving complex tasks. Namely, reinforcement learning (RL) proposes a theoretical background to learn optimal behaviors. In practice, RL algorithms rely on geometric discounts to evaluate this optimality. Unfortunately, this does not cover decision processes where future returns are not exponentially less valuable. Depending on the problem, this limitation induces sample-inefficiency (as feed-backs are exponentially decayed) and requires additional curricula/exploration mechanisms (to deal with sparse, deceptive or adversarial rewards). In this paper, we tackle these issues by generalizing the discounted problem formulation with a family of delayed objective functions. We investigate the underlying RL problem to derive: 1) the optimal stationary solution and 2) an approximation of the optimal non-stationary control. The devised algorithms solved hard exploration problems on tabular environment and improved sample-efficiency on classic simulated robotics benchmarks.
翻译:人工智能(AI)的努力是设计能够完成复杂任务的自主代理。 也就是说, 强化学习( RL) 提出了学习最佳行为的理论背景 。 实际上, RL 算法依靠几何折扣来评估这种最佳性。 不幸的是, 这不包括未来回报不会指数化地降低价值的决策过程。 根据问题, 这一限制导致抽样效率低下( 反馈反射指数衰减), 并需要额外的课程/ 探索机制( 处理稀少、 欺骗性或对抗性奖赏 ) 。 在本文中, 我们通过将折扣问题配方与延缓目标功能的组合结合起来来解决这些问题。 我们调查潜在的 RL 问题, 得出:(1) 最佳固定解决方案和(2) 最佳非静止控制近似。 设计的算法解决了表格环境中的硬性探索问题,提高了经典模拟机器人基准的样本效率 。