Deep reinforcement learning has enabled human-level or even super-human performance in various types of games. However, the amount of exploration required for learning is often quite large. Deep reinforcement learning also has super-human performance in that no human being would be able to achieve such amounts of exploration. To address this problem, we focus on the \textit{satisficing} policy, which is a qualitatively different approach from that of existing optimization algorithms. Thus, we propose Linear RS (LinRS), which is a type of satisficing algorithm and a linear extension of risk-sensitive satisficing (RS), for application to a wider range of tasks. The generalization of RS provides an algorithm to reduce the volume of exploratory actions by adopting a different approach from existing optimization algorithms. LinRS utilizes linear regression and multiclass classification to linearly approximate both the action value and proportion of action selections required in the RS calculation. The results of our experiments indicate that LinRS reduced the number of explorations and run time compared to those of existing algorithms in contextual bandit problems. These results suggest that a further generalization of satisficing algorithms may be useful for complex environments, including those that are to be handled with deep reinforcement learning.
翻译:深层强化学习使人类在各种类型的游戏中能够达到人的水平,甚至超人性表现。然而,为学习所需的探索量往往非常大。深层强化学习也有超人性的表现,因为没有人能够达到这样的探索量。为了解决这个问题,我们把重点放在了 ktextit{satisficing} 政策上,这是与现有优化算法在质量上不同的做法。因此,我们提议了Linear RS(LinRS) (LinRS),这是一种讽刺算法的一种类型,也是对风险敏感的讽刺学(RS)的线性扩展,用于更广泛的任务。深入的RS的普及提供了一种算法,通过采用与现有优化算法不同的方法来减少探索行动的数量。 LinRS使用线性回归和多级分类法,以线性地近似俄罗斯计算中所要求的行动选择值和比例。因此,我们的实验结果表明,LinRS(LinRS)减少了探索次数,运行时间与背景强度问题中的现有算法相比。这些结果表明,进一步的平面演算法的更普遍化,包括深层的演算法对于复杂的环境可能是有用的。