We adapt a pre-trained Vision-Language-Action (VLA) model (Open-VLA) for dexterous human-robot collaboration with minimal language prompting. Our approach adds (i) FiLM conditioning to visual backbones for task-aware perception, (ii) an auxiliary intent head that predicts collaborator hand pose and target cues, and (iii) action-space post-processing that predicts compact deltas (position/rotation) and PCA-reduced finger joints before mapping to full commands. Using a multi-view, teleoperated Franka and Mimic-hand dataset augmented with MediaPipe hand poses, we demonstrate that delta actions are well-behaved and that four principal components explain ~96% of hand-joint variance. Ablations identify action post-processing as the primary performance driver; auxiliary intent helps, FiLM is mixed, and a directional motion loss is detrimental. A real-time stack (~0.3 s latency on one RTX 4090) composes "pick-up" and "pass" into a long-horizon behavior. We surface "trainer overfitting" to specific demonstrators as the key limitation.
翻译:我们采用预训练的视觉-语言-动作(VLA)模型(Open-VLA),通过最少的语言提示实现灵巧的人机协作。我们的方法增加了:(i)在视觉骨干网络中引入FiLM条件机制,实现任务感知的视觉理解;(ii)一个辅助意图预测头,用于预测协作者的手部姿态和目标线索;(iii)动作空间后处理模块,在映射到完整指令前预测紧凑的增量动作(位置/旋转)和经PCA降维的手指关节。利用通过MediaPipe手部姿态增强的多视角遥操作Franka机械臂与Mimic-hand数据集,我们证明了增量动作具有良好的行为特性,且四个主成分可解释约96%的手部关节方差。消融实验表明,动作后处理是性能提升的主要驱动力;辅助意图预测有一定帮助,FiLM机制效果不一,而方向性运动损失函数则产生负面影响。实时处理栈(在单张RTX 4090上延迟约0.3秒)可将“拾取”与“传递”动作组合为长时程行为。我们揭示了“训练者过拟合”特定演示者成为该方法的主要局限。