While planning-based sequence modelling methods have shown great potential in continuous control, scaling them to high-dimensional state-action sequences remains an open challenge due to the high computational complexity and innate difficulty of planning in high-dimensional spaces. We propose the Trajectory Autoencoding Planner (TAP), a planning-based sequence modelling RL method that scales to high state-action dimensionalities. Using a state-conditional Vector-Quantized Variational Autoencoder (VQ-VAE), TAP models the conditional distribution of the trajectories given the current state. When deployed as an RL agent, TAP avoids planning step-by-step in a high-dimensional continuous action space but instead looks for the optimal latent code sequences by beam search. Unlike $O(D^3)$ complexity of Trajectory Transformer, TAP enjoys constant $O(C)$ planning computational complexity regarding state-action dimensionality $D$. Our empirical evaluation also shows the increasingly strong performance of TAP with the growing dimensionality. For Adroit robotic hand manipulation tasks with high state and action dimensionality, TAP surpasses existing model-based methods, including TT, with a large margin and also beats strong model-free actor-critic baselines.
翻译:虽然基于规划的序列建模方法在连续控制方面显示出巨大的潜力,但由于高空间规划的计算复杂程度和内在困难,将其推广到高维状态动作序列仍是一个公开的挑战。 我们建议采用基于规划的序列建模RL方法(TAP)这个基于规划的序列建模RL方法,该方法将升至国家行动的高度维度。使用州有条件的矢量量量化自动算法(VQ-VAE),TAP模型将轨迹按当前状态有条件地分布。在作为RL代理时,TAP将避免在高维持续行动空间逐步规划,而是通过波束搜索寻找最佳的潜在代码序列。与美元(DQ3)相比,TAP拥有恒定的美元(C),在州-行动维度方面规划的计算复杂性(VQ-VAE)值(VAE-VAE),我们的经验评估还显示TAP在日益增强的维度上表现的日益强劲。对于Adroitroit 机器人操纵任务,包括高基、高基级、高基级、高基级的TAP,还有高基级的TAP。