专长无需垄断：面向视觉-语言-动作学习的动作专用专家混合模型 (Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning)

Vision-Language-Action (VLA) models are experiencing rapid development and demonstrating promising capabilities in robotic manipulation tasks. However, scaling up VLA models presents several critical challenges: (1) Training new VLA models from scratch demands substantial computational resources and extensive datasets. Given the current scarcity of robot data, it becomes particularly valuable to fully leverage well-pretrained VLA model weights during the scaling process. (2) Real-time control requires carefully balancing model capacity with computational efficiency. To address these challenges, We propose AdaMoE, a Mixture-of-Experts (MoE) architecture that inherits pretrained weights from dense VLA models, and scales up the action expert by substituting the feedforward layers into sparsely activated MoE layers. AdaMoE employs a decoupling technique that decouples expert selection from expert weighting through an independent scale adapter working alongside the traditional router. This enables experts to be selected based on task relevance while contributing with independently controlled weights, allowing collaborative expert utilization rather than winner-takes-all dynamics. Our approach demonstrates that expertise need not monopolize. Instead, through collaborative expert utilization, we can achieve superior performance while maintaining computational efficiency. AdaMoE consistently outperforms the baseline model across key benchmarks, delivering performance gains of 1.8% on LIBERO and 9.3% on RoboTwin. Most importantly, a substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.

翻译：视觉-语言-动作（VLA）模型正经历快速发展，并在机器人操作任务中展现出有前景的能力。然而，扩展VLA模型面临若干关键挑战：（1）从头训练新的VLA模型需要大量计算资源和海量数据集。鉴于当前机器人数据的稀缺性，在扩展过程中充分利用预训练良好的VLA模型权重显得尤为重要。（2）实时控制需要仔细权衡模型容量与计算效率。为应对这些挑战，我们提出AdaMoE——一种专家混合（MoE）架构，该架构继承自密集VLA模型的预训练权重，并通过将前馈层替换为稀疏激活的MoE层来扩展动作专家。AdaMoE采用解耦技术，通过独立尺度适配器与传统路由器协同工作，将专家选择与专家加权解耦。这使得专家可根据任务相关性被选择，同时以独立控制的权重参与计算，从而实现协作式专家利用而非赢家通吃机制。我们的方法证明专长无需垄断。相反，通过协作式专家利用，我们能够在保持计算效率的同时获得更优性能。AdaMoE在关键基准测试中持续超越基线模型，在LIBERO上实现1.8%的性能提升，在RoboTwin上获得9.3%的增益。最重要的是，真实世界实验中21.5%的显著改进验证了其在机器人操作任务中的实际有效性。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日