Urban embodied AI agents, ranging from delivery robots to quadrupeds, are increasingly populating our cities, navigating chaotic streets to provide last-mile connectivity. Training such agents requires diverse, high-fidelity urban environments to scale, yet existing human-crafted or procedurally generated simulation scenes either lack scalability or fail to capture real-world complexity. We introduce UrbanVerse, a data-driven real-to-sim system that converts crowd-sourced city-tour videos into physics-aware, interactive simulation scenes. UrbanVerse consists of: (i) UrbanVerse-100K, a repository of 100k+ annotated urban 3D assets with semantic and physical attributes, and (ii) UrbanVerse-Gen, an automatic pipeline that extracts scene layouts from video and instantiates metric-scale 3D simulations using retrieved assets. Running in IsaacSim, UrbanVerse offers 160 high-quality constructed scenes from 24 countries, along with a curated benchmark of 10 artist-designed test scenes. Experiments show that UrbanVerse scenes preserve real-world semantics and layouts, achieving human-evaluated realism comparable to manually crafted scenes. In urban navigation, policies trained in UrbanVerse exhibit scaling power laws and strong generalization, improving success by +6.3% in simulation and +30.1% in zero-shot sim-to-real transfer comparing to prior methods, accomplishing a 300 m real-world mission with only two interventions.
翻译:城市具身智能体,从配送机器人到四足机器人,正日益遍布我们的城市,在混乱的街道中穿行以提供最后一公里连接。训练此类智能体需要多样化、高保真的城市环境以实现规模化,然而现有的人工构建或程序化生成的仿真场景要么缺乏可扩展性,要么无法捕捉现实世界的复杂性。我们提出了UrbanVerse,一个数据驱动的从真实到仿真的系统,可将众包的城市游览视频转换为具有物理感知的、可交互的仿真场景。UrbanVerse包含:(i) UrbanVerse-100K,一个包含10万多个带语义和物理属性标注的城市三维资产库;(ii) UrbanVerse-Gen,一个从视频中提取场景布局并使用检索到的资产实例化公制尺度三维仿真的自动化流程。在IsaacSim中运行的UrbanVerse提供了来自24个国家的160个高质量构建场景,以及一个包含10个由艺术家设计的测试场景的精选基准。实验表明,UrbanVerse场景保留了真实世界的语义和布局,其人类评估的真实感可与人工构建的场景相媲美。在城市导航任务中,在UrbanVerse中训练的策略展现出缩放幂律和强大的泛化能力,与先前方法相比,在仿真中成功率提高了+6.3%,在零样本仿真到真实迁移中提高了+30.1%,在仅需两次人工干预的情况下完成了300米的真实世界任务。