通过分析政策梯级培训提高培训效率的主计长 (Training Efficient Controllers via Analytic Policy Gradient)

Control design for robotic systems is complex and often requires solving an optimization to follow a trajectory accurately. Online optimization approaches like Model Predictive Control (MPC) have been shown to achieve great tracking performance, but require high computing power. Conversely, learning-based offline optimization approaches, such as Reinforcement Learning (RL), allow fast and efficient execution on the robot but hardly match the accuracy of MPC in trajectory tracking tasks. In systems with limited compute, such as aerial vehicles, an accurate controller that is efficient at execution time is imperative. We propose an Analytic Policy Gradient (APG) method to tackle this problem. APG exploits the availability of differentiable simulators by training a controller offline with gradient descent on the tracking error. We address training instabilities that frequently occur with APG through curriculum learning and experiment on a widely used controls benchmark, the CartPole, and two common aerial robots, a quadrotor and a fixed-wing drone. Our proposed method outperforms both model-based and model-free RL methods in terms of tracking error. Concurrently, it achieves similar performance to MPC while requiring more than an order of magnitude less computation time. Our work provides insights into the potential of APG as a promising control method for robotics. To facilitate the exploration of APG, we open-source our code and make it available at https://github.com/lis-epfl/apg_trajectory_tracking.

翻译：机器人系统的控制设计是复杂的,往往需要解决优化,以便精确地遵循轨迹。模拟预测控制(MPC)等在线优化方法已经显示能够实现巨大的跟踪性能,但需要高的计算力。相反,基于学习的离线优化方法,如强化学习(RL),允许在机器人上快速高效地执行,但几乎无法在轨迹跟踪任务中达到MPC的准确性。在计算有限的系统中,如航空飞行器,必须有一个精确的控制器,在执行时效率很高。我们提出了一种分析性政策梯度(APG)方法来解决这一问题。APG利用了不同模拟器的可用性,在跟踪错误时向离线脱轨的操作员提供培训。我们通过课程学习和实验,在广泛使用的控制基准、CartPole和两个通用的空中机器人、一个立石tor和一个固定的无人机。我们所提议的方法在追踪错误方面优于模型/无型RL方法。同时,通过在追踪错误时空的操作中实现类似的性模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟模拟操作,同时,同时,在进行我们操作系统,在需要一个有一定级算算算算算算算算算算法,而需要我们有更高级系统,而要求我们可操作系统可操作系统可操作操作系统可操作系统可操作系统可操作,而需要。