The growing adoption of augmented and virtual reality (AR and VR) technologies in industrial training and on-the-job assistance has created new opportunities for intelligent, context-aware support systems. As workers perform complex tasks guided by AR and VR, these devices capture rich streams of multimodal data, including gaze, hand actions, and task progression, that can reveal user intent and task state in real time. Leveraging this information effectively remains a major challenge. In this work, we present a context-aware large language model (LLM) assistant that integrates diverse data modalities, such as hand actions, task steps, and dialogue history, into a unified framework for real-time question answering. To systematically study how context influences performance, we introduce an incremental prompting framework, where each model version receives progressively richer contextual inputs. Using the HoloAssist dataset, which records AR-guided task executions, we evaluate how each modality contributes to the assistant's effectiveness. Our experiments show that incorporating multimodal context significantly improves the accuracy and relevance of responses. These findings highlight the potential of LLM-driven multimodal integration to enable adaptive, intuitive assistance for AR and VR-based industrial training and assistance.
翻译:增强现实与虚拟现实(AR与VR)技术在工业培训和现场作业辅助中的日益普及,为智能化的情境感知支持系统创造了新的机遇。当工作人员在AR和VR引导下执行复杂任务时,这些设备能够捕获丰富的多模态数据流,包括视线、手部动作和任务进度,从而实时反映用户意图与任务状态。如何有效利用这些信息仍是一个重大挑战。本研究提出一种情境感知的大型语言模型(LLM)辅助系统,该系统将手部动作、任务步骤和对话历史等多种数据模态整合到统一框架中,实现实时问答功能。为系统研究情境信息对性能的影响,我们引入渐进式提示框架,使每个模型版本逐步接收更丰富的情境输入。基于记录AR引导任务执行的HoloAssist数据集,我们评估了各模态对辅助系统效能的贡献。实验结果表明,融合多模态情境信息显著提升了响应的准确性与相关性。这些发现凸显了LLM驱动的多模态集成在实现自适应、直观的AR与VR工业培训及辅助方面的潜力。