Aerospace embodied intelligence aims to empower unmanned aerial vehicles (UAVs) and other aerospace platforms to achieve autonomous perception, cognition, and action, as well as egocentric active interaction with humans and the environment. The aerospace embodied world model serves as an effective means to realize the autonomous intelligence of UAVs and represents a necessary pathway toward aerospace embodied intelligence. However, existing embodied world models primarily focus on ground-level intelligent agents in indoor scenarios, while research on UAV intelligent agents remains unexplored. To address this gap, we construct the first large-scale real-world image-text pre-training dataset, AerialAgent-Ego10k, featuring urban drones from a first-person perspective. We also create a virtual image-text-pose alignment dataset, CyberAgent Ego500k, to facilitate the pre-training of the aerospace embodied world model. For the first time, we clearly define 5 downstream tasks, i.e., aerospace embodied scene awareness, spatial reasoning, navigational exploration, task planning, and motion decision, and construct corresponding instruction datasets, i.e., SkyAgent-Scene3k, SkyAgent-Reason3k, SkyAgent-Nav3k and SkyAgent-Plan3k, and SkyAgent-Act3k, for fine-tuning the aerospace embodiment world model. Simultaneously, we develop SkyAgentEval, the downstream task evaluation metrics based on GPT-4, to comprehensively, flexibly, and objectively assess the results, revealing the potential and limitations of 2D/3D visual language models in UAV-agent tasks. Furthermore, we integrate over 10 2D/3D visual-language models, 2 pre-training datasets, 5 finetuning datasets, more than 10 evaluation metrics, and a simulator into the benchmark suite, i.e., AeroVerse, which will be released to the community to promote exploration and development of aerospace embodied intelligence.
翻译:航空航天具身智能旨在赋能无人机及其他航空航天平台,实现自主感知、认知与行动,以及与人类和环境的以自我为中心的主动交互。航空航天具身世界模型是实现无人机自主智能的有效手段,也是通向航空航天具身智能的必经之路。然而,现有的具身世界模型主要集中于室内场景的地面智能体,而针对无人机智能体的研究尚属空白。为填补这一空白,我们构建了首个大规模真实世界图像-文本预训练数据集AerialAgent-Ego10k,其以第一人称视角呈现城市无人机场景。我们还创建了一个虚拟图像-文本-姿态对齐数据集CyberAgent Ego500k,以促进航空航天具身世界模型的预训练。我们首次明确定义了5项下游任务,即航空航天具身场景感知、空间推理、导航探索、任务规划与运动决策,并构建了相应的指令数据集,即SkyAgent-Scene3k、SkyAgent-Reason3k、SkyAgent-Nav3k、SkyAgent-Plan3k和SkyAgent-Act3k,用于微调航空航天具身世界模型。同时,我们开发了基于GPT-4的下游任务评估指标SkyAgentEval,以全面、灵活、客观地评估结果,揭示2D/3D视觉语言模型在无人机智能体任务中的潜力与局限。此外,我们将超过10个2D/3D视觉语言模型、2个预训练数据集、5个微调数据集、10余项评估指标以及一个仿真器集成至基准套件AeroVerse中,该套件将向社区开源,以推动航空航天具身智能的探索与发展。