Value functions are central to Dynamic Programming and Reinforcement Learning but their exact estimation suffers from the curse of dimensionality, challenging the development of practical value-function (VF) estimation algorithms. Several approaches have been proposed to overcome this issue, from non-parametric schemes that aggregate states or actions to parametric approximations of state and action VFs via, e.g., linear estimators or deep neural networks. Relevantly, several high-dimensional state problems can be well-approximated by an intrinsic low-rank structure. Motivated by this and leveraging results from low-rank optimization, this paper proposes different stochastic algorithms to estimate a low-rank factorization of the $Q(s, a)$ matrix. This is a non-parametric alternative to VF approximation that dramatically reduces the computational and sample complexities relative to classical $Q$-learning methods that estimate $Q(s,a)$ separately for each state-action pair.
翻译:价值功能是动态规划和强化学习的核心,但其准确估计却受到维度的诅咒,对实际价值功能估算算法的发展提出了挑战。为解决这一问题,提出了若干办法,从非参数性计划(通过线性估计器或深神经网络等方式,综合国家或行动状态和行动的参数近似值)到非参数性计划(通过线性估测器或深神经网络),从非参数性计划(通过线性估计器或深层神经网络)到参数性指标性估算。相关的是,若干高度状态问题可能与内在的低层次结构十分接近。受此因素和低层次优化的杠杆效应影响,本文提出了不同的随机算法,以估计美元(a)矩阵的低等级系数。这是与VF接近性估算值(a)的不完全参数性替代法(VF),大大降低了计算和抽样的复杂性,而典型的美元学习方法则分别估算每对州-行动对方的美元(s,a)值。