In this paper, we investigate the problem of \textit{episodic reinforcement learning} with quantum oracles for state evolution. To this end, we propose an \textit{Upper Confidence Bound} (UCB) based quantum algorithmic framework to facilitate learning of a finite-horizon MDP. Our quantum algorithm achieves an exponential improvement in regret as compared to the classical counterparts, achieving a regret of $\Tilde{\mathcal{O}}(1)$ as compared to $\Tilde{\mathcal{O}}(\sqrt{K})$ \footnote{$\Tilde{\mathcal{O}}(\cdot)$ hides logarithmic terms.}, $K$ being the number of training episodes. In order to achieve this advantage, we exploit efficient quantum mean estimation technique that provides quadratic improvement in the number of i.i.d. samples needed to estimate the mean of sub-Gaussian random variables as compared to classical mean estimation. This improvement is a key to the significant regret improvement in quantum reinforcement learning. We provide proof-of-concept experiments on various RL environments that in turn demonstrate performance gains of the proposed algorithmic framework.
翻译:在本文中, 我们用国家进化的量子标记来调查 \ textit{ epissodi 强化学习 问题。 为此, 我们提出一个基于\ textit{ Upper Incure Bound} (UCB) 的量子算法框架, 以便利学习一个限定的偏差 MDP 。 我们的量子算法与古典对应方相比, 取得了惊人的改善。 我们利用了高效的量子表示估算技术, 使i. d. d. 的量子值数量得到改进, 与古典平均估计相比, 需要样本来估计亚库西随机变量的平均值。 这一改进对于在量子强化学习方面显著的遗憾改进是关键。 我们提供了各种实验成果的证明。