Recent progress in robotics and embodied AI is largely driven by Large Multimodal Models (LMMs). However, a key challenge remains underexplored: how can we advance LMMs to discover tasks that assist humans in open-future scenarios, where human intentions are highly concurrent and dynamic. In this work, we formalize the problem of Human-centric Open-future Task Discovery (HOTD), focusing particularly on identifying tasks that reduce human effort across plausible futures. To facilitate this study, we propose HOTD-Bench, which features over 2K real-world videos, a semi-automated annotation pipeline, and a simulation-based protocol tailored for open-set future evaluation. Additionally, we propose the Collaborative Multi-Agent Search Tree (CMAST) framework, which decomposes complex reasoning through a multi-agent system and structures the reasoning process through a scalable search tree module. In our experiments, CMAST achieves the best performance on the HOTD-Bench, significantly surpassing existing LMMs. It also integrates well with existing LMMs, consistently improving performance.
翻译:机器人与具身智能的最新进展主要受大型多模态模型(LMMs)驱动。然而,一个关键挑战尚未得到充分探索:如何推动LMMs在开放未来场景中发现能够辅助人类的任务,其中人类意图具有高度并发性与动态性。本研究形式化了以人为中心的开放未来任务发现(HOTD)问题,特别关注识别那些在合理未来情境下能够减少人类工作负担的任务。为支持该研究,我们提出了HOTD-Bench基准,其包含超过2000个真实世界视频、半自动化标注流程,以及专为开放集未来评估设计的仿真协议。此外,我们提出了协作多智能体搜索树(CMAST)框架,该框架通过多智能体系统分解复杂推理过程,并借助可扩展的搜索树模块构建推理流程。实验表明,CMAST在HOTD-Bench上取得最优性能,显著超越现有LMMs。该框架还能与现有LMMs良好集成,持续提升其性能表现。