A long-standing goal of intelligent assistants such as AR glasses/robots has been to assist users in affordance-centric real-world scenarios, such as "how can I run the microwave for 1 minute?". However, there is still no clear task definition and suitable benchmarks. In this paper, we define a new task called Affordance-centric Question-driven Task Completion, where the AI assistant should learn from instructional videos and scripts to guide the user step-by-step. To support the task, we constructed AssistQ, a new dataset comprising 531 question-answer samples derived from 100 newly filmed first-person videos. Each question should be completed with multi-step guidances by inferring from visual details (e.g., buttons' position) and textural details (e.g., actions like press/turn). To address this unique task, we developed a Question-to-Actions (Q2A) model that significantly outperforms several baseline methods while still having large room for improvement. We expect our task and dataset to advance Egocentric AI Assistant's development. Our project page is available at: https://showlab.github.io/assistq
翻译:智能助手(如AR Glass/robots)的长期目标一直是协助用户提供以价格为中心的真实世界情景,如“我如何运行微波1分钟?” 。 然而,任务定义和基准仍然不够明确。 在本文件中,我们定义了一个新的任务,即“以价格为中心、以问题驱动的任务完成”,在这个任务中,AI助理应该从指导用户一步一步地学习教学视频和脚本。为了支持这项任务,我们设计了一个新的数据集,其中包括从100个新拍摄的第一人视频中提取的531个问答样本。每个问题都应该通过从视觉细节(如按钮位置)和文本细节(如新闻/转向等行动)中推断出多步指南来完成。为了应对这一独特任务,我们开发了一个“问题-行动”模型,该模型大大超越了几种基线方法,同时仍有很大的改进空间。我们期待我们的任务和数据集能推进Egocentic AI助理的开发工作。我们的项目可以在 https:// laqraqua/sstrangs page: https:// https.