Recent advances in Vision-Language-Action (VLA) and world-model methods have improved generalization in tasks such as robotic manipulation and object interaction. However, Successful execution of such tasks depends on large, costly collections of real demonstrations, especially for fine-grained manipulation of articulated objects. To address this, we present AOMGen, a scalable data generation framework for articulated manipulation which is instantiated from a single real scan, demonstration and a library of readily available digital assets, yielding photoreal training data with verified physical states. The framework synthesizes synchronized multi-view RGB temporally aligned with action commands and state annotations for joints and contacts, and systematically varies camera viewpoints, object styles, and object poses to expand a single execution into a diverse corpus. Experimental results demonstrate that fine-tuning VLA policies on AOMGen data increases the success rate from 0% to 88.7%, and the policies are tested on unseen objects and layouts.
翻译:近年来,视觉-语言-动作模型与世界模型方法在机器人操作和物体交互等任务中的泛化能力取得了显著进展。然而,此类任务的成功执行依赖于大量且成本高昂的真实演示数据收集,尤其是在关节物体的精细操作方面。为解决这一问题,我们提出了AOMGen——一个可扩展的关节操作数据生成框架。该框架仅需单次真实扫描、一次演示及一个现成数字资产库即可实例化,生成具有已验证物理状态的逼真训练数据。该框架合成了与动作指令及关节与接触状态标注时间同步的多视角RGB序列,并通过系统性地变换相机视角、物体风格与物体位姿,将单次执行扩展为多样化数据集。实验结果表明,基于AOMGen数据对视觉-语言-动作策略进行微调,可将成功率从0%提升至88.7%,且策略在未见过的物体与场景布局中得到了有效验证。