The performance of a reinforcement learning (RL) system depends on the computational architecture used to approximate a value function. Deep learning methods provide both optimization techniques and architectures for approximating nonlinear functions from noisy, high-dimensional observations. However, prevailing optimization techniques are not designed for strictly-incremental online updates. Nor are standard architectures designed for observations with an a priori unknown structure: for example, light sensors randomly dispersed in space. This paper proposes an online RL prediction algorithm with an adaptive architecture that efficiently finds useful nonlinear features. The algorithm is evaluated in a spatial domain with high-dimensional, stochastic observations. The algorithm outperforms non-adaptive baseline architectures and approaches the performance of an architecture given side-channel information. These results are a step towards scalable RL algorithms for more general problems, where the observation structure is not available.
翻译:强化学习(RL)系统的性能取决于用于接近值函数的计算结构。深深层学习方法既提供优化技术和结构,又提供来自噪音、高维观测的近似非线性功能的优化技术和结构。然而,当前流行的优化技术并非设计用于严格的强化在线更新。也没有设计用于具有先验性未知结构的观测的标准结构:例如,在空间随机散布的光传感器。本文提出在线RL预测算法,其适应性结构可有效发现有用的非线性功能。该算法是在一个空间域中用高维度、随机观测来评价的。算法超越了非适应性基线结构,并接近了给定侧通道信息的性能。这些结果是在没有观测结构的情况下,为更一般性的问题向可扩展的RL算法迈出了一步。