热点问题：拓扑指导的演员-评论模块化学习连续系统的时间目标 (Topological Guided Actor-Critic Modular Learning of Continuous Systems with Temporal Objectives)

This work investigates the formal policy synthesis of continuous-state stochastic dynamic systems given high-level specifications in linear temporal logic. To learn an optimal policy that maximizes the satisfaction probability, we take a product between a dynamic system and the translated automaton to construct a product system on which we solve an optimal planning problem. Since this product system has a hybrid product state space that results in reward sparsity, we introduce a generalized optimal backup order, in reverse to the topological order, to guide the value backups and accelerate the learning process. We provide the optimality proof for using the generalized optimal backup order in this optimal planning problem. Further, this paper presents an actor-critic reinforcement learning algorithm when topological order applies. This algorithm leverages advanced mathematical techniques and enjoys the property of hyperparameter self-tuning. We provide proof of the optimality and convergence of our proposed reinforcement learning algorithm. We use neural networks to approximate the value function and policy function for hybrid product state space. Furthermore, we observe that assigning integer numbers to automaton states can rank the value or policy function approximated by neural networks. To break the ordinal relationship, we use an individual neural network for each automaton state's value (policy) function, termed modular learning. We conduct two experiments. First, to show the efficacy of our reinforcement learning algorithm, we compare it with baselines on a classic control task, CartPole. Second, we demonstrate the empirical performance of our formal policy synthesis framework on motion planning of a Dubins car with a temporal specification.

翻译：本研究探讨了在线性时间逻辑中给定高级规范的连续状态随机动态系统的正式策略合成。为了学习最大化满足概率的最优策略，我们采用动态系统和翻译自动机之间的乘积构建乘积系统，从而解决最优规划问题。由于这个乘积系统具有混合乘积状态空间，导致奖励稀疏，我们引入了一个逆拓扑顺序的广义最优备份顺序，以指导价值备份和加速学习过程。我们提供了在这个最优规划问题上使用广义最优备份顺序的最优性证明。此外，本文提出了一个当拓扑顺序适用时的演员评论强化学习算法。这个算法利用先进的数学技术，具有级超参数自调节的性质。我们提供了我们提出的强化学习算法的最优性和收敛性证明。我们使用神经网络来逼近混合乘积态空间的价值函数和策略函数。此外，我们观察到，将整数赋给自动机状态可以对由神经网络逼近的价值或策略函数进行排名。为了打破序数关系，我们为每个自动机状态的值（策略）函数使用单独的神经网络，称为模块化学习。我们进行了两个实验。首先，为了展示我们的强化学习算法的有效性，我们将其与基线在一个经典的控制任务CartPole上进行了比较。其次，我们展示了我们的正式策略综合框架在带有时间规范的Dubins小车运动规划中的实验表现。