This paper investigates to what extent one can improve reinforcement learning algorithms. Our study is split in three parts. First, our analysis shows that the classical asymptotic convergence rate $O(1/\sqrt{N})$ is pessimistic and can be replaced by $O((\log(N)/N)^{\beta})$ with $\frac{1}{2}\leq \beta \leq 1$ and $N$ the number of iterations. Second, we propose a dynamic optimal policy for the choice of the learning rate $(\gamma_k)_{k\geq 0}$ used in stochastic approximation (SA). We decompose our policy into two interacting levels: the inner and the outer level. In the inner level, we present the \nameref{Alg:v_4_s} algorithm (for "PAst Sign Search") which, based on a predefined sequence $(\gamma^o_k)_{k\geq 0}$, constructs a new sequence $(\gamma^i_k)_{k\geq 0}$ whose error decreases faster. In the outer level, we propose an optimal methodology for the selection of the predefined sequence $(\gamma^o_k)_{k\geq 0}$. Third, we show empirically that our selection methodology of the learning rate outperforms significantly standard algorithms used in reinforcement learning (RL) in the three following applications: the estimation of a drift, the optimal placement of limit orders and the optimal execution of large number of shares.
翻译:本文调查我们在多大程度上可以改进强化学习算法。 我们的研究分为三个部分。 首先, 我们的分析显示, 古典失常趋同率 $O( 1/\ sqrt{N} $) 是悲观的, 可以用$O( log( N)/ N)\\\\\ beeta} 美元替换为$( frac{ 1\\\\\\\\\\\\\\\\\ \leq 1 美元) 和 迭代数 。 其次, 我们提出一个动态的最佳政策, 用于选择 $( gamma_ k) / k\ g\ g Q) 的学习率 。 我们将我们的政策分解成两个互动级别 : 内和外等级别 。 在内部级别, 我们用 nameref{ Alg: v_ 4_ sq} 算法( “ Past sign search 搜索 ” ), 根据一个预定义的序列 $(\ gammamamable rial) ex relection a rial relection rial relection riew ristration a mess lemental riewation riew riew lemental riew lement lemental lection lemental lection a.