Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former. Experiments with the YouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long-context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.
翻译:为实现共同目标的人机协作,需要机器人理解人类动作及其与周围环境的交互。本文聚焦于基于人机对话的人机交互(HRI),该交互依赖于通过多模态场景理解实现机器人动作确认与动作步骤生成。当前最先进的方法采用多模态Transformer,从包含多个微步骤任务的单个视频片段中生成与机器人动作确认对齐的机器人动作步骤。尽管面向长时程任务的动作在整个视频中相互依赖,但现有方法主要集中于片段级处理,未能利用长上下文信息。本文提出一种长上下文Q-Former,在完整视频中整合左右上下文依赖性。此外,本文提出一种文本条件化方法,将文本嵌入直接馈入大语言模型解码器,以缓解Q-Former对文本信息的高度抽象化。在YouCook2数据集上的实验表明,确认生成的准确性是动作规划性能的主要影响因素。进一步地,我们证明长上下文Q-Former通过集成VideoLLaMA3,提升了确认生成与动作规划的效果。