While quantum reinforcement learning (RL) has attracted a surge of attention recently, its theoretical understanding is limited. In particular, it remains elusive how to design provably efficient quantum RL algorithms that can address the exploration-exploitation trade-off. To this end, we propose a novel UCRL-style algorithm that takes advantage of quantum computing for tabular Markov decision processes (MDPs) with $S$ states, $A$ actions, and horizon $H$, and establish an $\mathcal{O}(\mathrm{poly}(S, A, H, \log T))$ worst-case regret for it, where $T$ is the number of episodes. Furthermore, we extend our results to quantum RL with linear function approximation, which is capable of handling problems with large state spaces. Specifically, we develop a quantum algorithm based on value target regression (VTR) for linear mixture MDPs with $d$-dimensional linear representation and prove that it enjoys $\mathcal{O}(\mathrm{poly}(d, H, \log T))$ regret. Our algorithms are variants of UCRL/UCRL-VTR algorithms in classical RL, which also leverage a novel combination of lazy updating mechanisms and quantum estimation subroutines. This is the key to breaking the $\Omega(\sqrt{T})$-regret barrier in classical RL. To the best of our knowledge, this is the first work studying the online exploration in quantum RL with provable logarithmic worst-case regret.
翻译:虽然量子增强学习(RL)最近引起了人们的注意,但它的理论理解有限。特别是,它仍然难以设计出能够解决勘探-开发交易交易的、可证实的高效量子RL算法。为此,我们提议了一个新型的UCRL型算法,利用以美元计算马可夫决定过程(MDPs)的量子计算法,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,建立一个美元为单位,以美元为单位,以美元为单位,建立最坏的量的量子计算法(S, A, H,\log T), 以美元为单位, 以美元为单位, 以美元为单位, 以美元为单位, 以美元为单位, 以美元为单位, 最值为单位, 以美元为单位, 以美元为单位, 以美元为单位, 最值为单位, 最差的量的量的量的量的量值, 以线程的RL