Planning-based reinforcement learning has shown strong performance in tasks in discrete and low-dimensional continuous action spaces. However, planning usually brings significant computational overhead for decision-making, and scaling such methods to high-dimensional action spaces remains challenging. To advance efficient planning for high-dimensional continuous control, we propose Trajectory Autoencoding Planner (TAP), which learns low-dimensional latent action codes with a state-conditional VQ-VAE. The decoder of the VQ-VAE thus serves as a novel dynamics model that takes latent actions and current state as input and reconstructs long-horizon trajectories. During inference time, given a starting state, TAP searches over discrete latent actions to find trajectories that have both high probability under the training distribution and high predicted cumulative reward. Empirical evaluation in the offline RL setting demonstrates low decision latency which is indifferent to the growing raw action dimensionality. For Adroit robotic hand manipulation tasks with high-dimensional continuous action space, TAP surpasses existing model-based methods by a large margin and also beats strong model-free actor-critic baselines.
翻译:以规划为基础的强化学习在离散和低维持续行动空间的任务中表现良好。然而,规划通常为决策带来大量的计算间接费用,并将这些方法推广到高维行动空间仍然具有挑战性。为了推进高维连续控制的有效规划,我们提议Tapotory Autencoding Planner(TAP),该规划用州立条件VQ-VAE学习低维潜伏行动代码。VQ-VAE的解码器因此作为一种新颖的动态模型,以潜在行动和当前状态作为输入并重建长正方位轨道。在推断期间,考虑到一个起始状态,TAP对离散潜在行动进行搜索,以寻找在培训分布和高预期累积奖励下具有高度概率的轨迹。在离线 RL 设置下进行的经验性评估表明,低决策惯性与日益增强的原始行动维度不相干。对于高维度的田径机器人操纵任务来说,TAP以大型边距比强的模型基基线超过现有的模型。