We study stochastic approximation procedures for approximately solving a $d$-dimensional linear fixed point equation based on observing a trajectory of length $n$ from an ergodic Markov chain. We first exhibit a non-asymptotic bound of the order $t_{\mathrm{mix}} \tfrac{d}{n}$ on the squared error of the last iterate of a standard scheme, where $t_{\mathrm{mix}}$ is a mixing time. We then prove a non-asymptotic instance-dependent bound on a suitably averaged sequence of iterates, with a leading term that matches the local asymptotic minimax limit, including sharp dependence on the parameters $(d, t_{\mathrm{mix}})$ in the higher order terms. We complement these upper bounds with a non-asymptotic minimax lower bound that establishes the instance-optimality of the averaged SA estimator. We derive corollaries of these results for policy evaluation with Markov noise -- covering the TD($\lambda$) family of algorithms for all $\lambda \in [0, 1)$ -- and linear autoregressive models. Our instance-dependent characterizations open the door to the design of fine-grained model selection procedures for hyperparameter tuning (e.g., choosing the value of $\lambda$ when running the TD($\lambda$) algorithm).
翻译:我们研究大约解决美元维度线性固定点方程式的随机近似程序, 其基础是观察从ergodic Markov 链条中观察到的长度为$n的轨迹。 我们首先在标准方案最后一次迭代的平方差差差中显示一个非非非非非亚值的缩略式近似程序。 在标准方案上次迭代的平方差差差差中, $@ mathrm{ mix} 是一个混合时间 。 我们然后证明一个非非亚光度线性例约束, 以适当的平均迭代序列为对象, 领先的术语符合本地的亚光度微缩缩缩缩缩限, 包括更高顺序中对参数$( d, t\\ mathrm{m ⁇ ) 的强烈依赖性。 我们用一个非亚光度小型的下限来补充这些上限, 从而确定平均的 SA 估量值的正比最优度。 我们为政策评价绘制了这些结果的折数值, 标值的正值的正值 值 值 值 。