We study generalizable policy learning from demonstrations for complex low-level control tasks (e.g., contact-rich object manipulations). We propose an imitation learning method that incorporates the idea of temporal abstraction and the planning capabilities from Hierarchical RL (HRL) in a novel and effective manner. As a step towards decision foundation models, our design can utilize scalable, albeit highly sub-optimal, demonstrations. Specifically, we find certain short subsequences of the demos, i.e. the chain-of-thought (CoT), reflect their hierarchical structures by marking the completion of subgoals in the tasks. Our model learns to dynamically predict the entire CoT as coherent and structured long-term action guidance and consistently outperforms typical two-stage subgoal-conditioned policies. On the other hand, such CoT facilitates generalizable policy learning as they exemplify the decision patterns shared among demos (even those with heavy noises and randomness). Our method, Chain-of-Thought Predictive Control (CoTPC), significantly outperforms existing ones on challenging low-level manipulation tasks from scalable yet highly sub-optimal demos.
翻译:我们研究了面向复杂低层控制任务(例如具有接触的物体操作)的可推广策略学习。我们提出了一种模仿学习方法,将时间抽象的思想和分层强化学习(HRL)的规划能力以一种新颖有效的方式结合起来。作为决策基础模型的一步,我们的设计可以利用可扩展的,虽然高度次优的演示。具体来说,我们发现演示中的某些短子序列,即链式思考(CoT),通过标记任务中的子目标完成来反映它们的分层结构。我们的模型学习动态预测整个CoT作为一致和结构化的长期行动指导,并一贯优于典型的两阶段子目标条件策略。另一方面,这样的CoT有助于可推广的策略学习,因为它们是演示之间共享决策模式的示例(即使是那些具有较多噪声和随机性的示例)。我们的方法,链式思考预测控制(CoTPC),在挑战性的低层操作任务中从可扩展但高度次优的演示中显著优于现有方法。