Vision-language-action models (VLAs) trained on large-scale robotic datasets have demonstrated strong performance on manipulation tasks, including bimanual tasks. However, because most public datasets focus on single-arm demonstrations, adapting VLAs for bimanual tasks typically requires substantial additional bimanual data and fine-tuning. To address this challenge, we introduce TwinVLA, a modular framework that composes two copies of a pretrained single-arm VLA into a coordinated bimanual VLA. Unlike monolithic cross-embodiment models trained on mixtures of single-arm and bimanual data, TwinVLA improves both data efficiency and performance by composing pretrained single-arm policies. Across diverse bimanual tasks in real-world and simulation settings, TwinVLA outperforms a comparably-sized monolithic RDT-1B model without requiring any bimanual pretraining. Furthermore, it narrows the gap to state-of-the-art model, $\pi_0$ which rely on extensive proprietary bimanual data and compute cost. These results establish our modular composition approach as a data-efficient and scalable path toward high-performance bimanual manipulation, leveraging public single-arm data.
翻译:在大规模机器人数据集上训练的视觉-语言-动作模型(VLAs)已在多种操作任务(包括双臂任务)中展现出卓越性能。然而,由于大多数公开数据集主要关注单臂演示,将VLAs应用于双臂任务通常需要大量额外的双臂数据及微调过程。为解决这一挑战,我们提出了TwinVLA——一种模块化框架,通过组合两个预训练的单臂VLA副本构建出协同的双臂VLA。与在单臂与双臂混合数据上训练的单一跨本体模型不同,TwinVLA通过组合预训练的单臂策略,在提升数据效率的同时增强了任务性能。在真实世界与仿真环境中的多种双臂任务测试中,TwinVLA在无需任何双臂预训练的情况下,性能优于同等规模的单一RDT-1B模型。此外,该方法显著缩小了与依赖大量私有双臂数据及高昂计算成本的先进模型(如$\\pi_0$)之间的差距。这些结果表明,我们的模块化组合方法为利用公开单臂数据实现高性能双臂操作提供了一条数据高效且可扩展的技术路径。