While Vision-Language-Action (VLA) models have demonstrated impressive capabilities in robotic manipulation, their performance in complex reasoning and long-horizon task planning is limited by data scarcity and model capacity. To address this, we introduce ManiAgent, an agentic architecture for general manipulation tasks that achieves end-to-end output from task descriptions and environmental inputs to robotic manipulation actions. In this framework, multiple agents involve inter-agent communication to perform environmental perception, sub-task decomposition and action generation, enabling efficient handling of complex manipulation scenarios. Evaluations show ManiAgent achieves an 86.8% success rate on the SimplerEnv benchmark and 95.8% on real-world pick-and-place tasks, enabling efficient data collection that yields VLA models with performance comparable to those trained on human-annotated datasets. The project webpage is available at https://yi-yang929.github.io/ManiAgent/.
翻译:尽管视觉-语言-动作(VLA)模型在机器人操作中展现了令人印象深刻的能力,但其在复杂推理与长程任务规划方面的性能受限于数据稀缺性与模型容量。为解决此问题,我们提出了ManiAgent——一种面向通用操作任务的智能体架构,能够实现从任务描述与环境输入到机器人操作动作的端到端输出。在该框架中,多个智能体通过智能体间通信来执行环境感知、子任务分解与动作生成,从而实现对复杂操作场景的高效处理。评估结果表明,ManiAgent在SimplerEnv基准测试中取得了86.8%的成功率,在现实世界取放任务中达到95.8%,并能实现高效的数据收集,使训练出的VLA模型性能媲美基于人工标注数据集训练的模型。项目网页地址为:https://yi-yang929.github.io/ManiAgent/。