Sim-to-real transfer remains a fundamental challenge in robot manipulation due to the entanglement of perception and control in end-to-end learning. We present a decoupled framework that learns each component where it is most reliable: control policies are trained in simulation with privileged state to master spatial layouts and manipulation dynamics, while perception is adapted only at deployment to bridge real observations to the frozen control policy. Our key insight is that control strategies and action patterns are universal across environments and can be learned in simulation through systematic randomization, while perception is inherently domain-specific and must be learned where visual observations are authentic. Unlike existing end-to-end approaches that require extensive real-world data, our method achieves strong performance with only 10-20 real demonstrations by reducing the complex sim-to-real problem to a structured perception alignment task. We validate our approach on tabletop manipulation tasks, demonstrating superior data efficiency and out-of-distribution generalization compared to end-to-end baselines. The learned policies successfully handle object positions and scales beyond the training distribution, confirming that decoupling perception from control fundamentally improves sim-to-real transfer.
翻译:仿真到现实的迁移在机器人操作中仍然是一个根本性挑战,这主要源于端到端学习中感知与控制的相互纠缠。我们提出了一种解耦框架,使每个组件在其最可靠的领域进行学习:控制策略在仿真环境中利用特权状态进行训练,以掌握空间布局和操作动力学;而感知模块仅在部署时进行适配,以将真实观测数据桥接到已冻结的控制策略上。我们的核心见解是:控制策略和动作模式在不同环境中具有普适性,可以通过系统化的随机化在仿真中学习;而感知则本质上是领域特定的,必须在视觉观测真实的环境中进行学习。与现有需要大量真实世界数据的端到端方法不同,我们的方法仅需10-20个真实演示即可实现强劲性能,其关键在于将复杂的仿真到现实问题简化为一个结构化的感知对齐任务。我们在桌面操作任务上验证了所提出的方法,结果表明相较于端到端基线方法,我们的方法在数据效率和分布外泛化能力方面均表现出显著优势。学习到的策略能够成功处理超出训练分布范围的物体位置和尺度,这证实了将感知与控制解耦能够从根本上改进仿真到现实的迁移效果。