强化学习的差别量政策 (Variational quantum policies for reinforcement learning)

Variational quantum circuits have recently gained popularity as quantum machine learning models. While considerable effort has been invested to train them in supervised and unsupervised learning settings, relatively little attention has been given to their potential use in reinforcement learning. In this work, we leverage the understanding of quantum policy gradient algorithms in a number of ways. First, we investigate how to construct and train reinforcement learning policies based on variational quantum circuits. We propose several designs for quantum policies, provide their learning algorithms, and test their performance on classical benchmarking environments. Second, we show the existence of task environments with a provable separation in performance between quantum learning agents and any polynomial-time classical learner, conditioned on the widely-believed classical hardness of the discrete logarithm problem. We also consider more natural settings, in which we show an empirical quantum advantage of our quantum policies over standard neural-network policies. Our results constitute a first step towards establishing a practical near-term quantum advantage in a reinforcement learning setting. Additionally, we believe that some of our design choices for variational quantum policies may also be beneficial to other models based on variational quantum circuits, such as quantum classifiers and quantum regression models.

翻译：作为量子机器学习模型,变化量子电路最近越来越受欢迎。虽然在受监督和不受监督的学习环境中为培训它们投入了大量努力,但相对较少注意其在强化学习中的潜在用途。在这项工作中,我们以多种方式利用量子政策梯度算法的理解。首先,我们调查如何根据变量量子电路构建和训练强化学习政策。我们提出了量子政策的若干设计,提供了它们的学习算法,并测试了它们在典型基准环境中的性能。第二,我们显示了在量子学习剂和任何多时古典学习器之间可辨别的性能工作环境,以离散的对数问题具有广泛想象的经典硬性为条件。我们还考虑了更多的自然环境,在这些环境中,我们展示了我们的量子政策在标准神经网络政策上的经验量子优势。我们的结果是朝着在强化学习环境中建立实用的近期量子优势迈出的第一步。此外,我们认为,我们关于变化量子政策的一些设计选择也可能有利于基于变量子回率模型的其他模型,例如骨质模型。