Massive datasets and high-capacity models have driven many recent advancements in computer vision and natural language understanding. This work presents a platform to enable similar success stories in Embodied AI. We propose ProcTHOR, a framework for procedural generation of Embodied AI environments. ProcTHOR enables us to sample arbitrarily large datasets of diverse, interactive, customizable, and performant virtual environments to train and evaluate embodied agents across navigation, interaction, and manipulation tasks. We demonstrate the power and potential of ProcTHOR via a sample of 10,000 generated houses and a simple neural model. Models trained using only RGB images on ProcTHOR, with no explicit mapping and no human task supervision produce state-of-the-art results across 6 embodied AI benchmarks for navigation, rearrangement, and arm manipulation, including the presently running Habitat 2022, AI2-THOR Rearrangement 2022, and RoboTHOR challenges. We also demonstrate strong 0-shot results on these benchmarks, via pre-training on ProcTHOR with no fine-tuning on the downstream benchmark, often beating previous state-of-the-art systems that access the downstream training data.
翻译:大规模数据集和高容量模型促使计算机视觉和自然语言理解方面最近取得许多进展。这项工作提供了一个平台,使在人工智能中出现类似的成功事例。我们提议了ProcTHOR,这是在程序上生成人工智能环境的一个框架。ProcTHOR使我们得以对各种、互动、可定制和性能强的虚拟环境的大型数据集进行任意抽样,以训练和评价各种导航、互动和操作任务中体现的物剂。我们通过对生成的10,000个房屋和简单神经模型进行抽样培训,展示了ProcTHOR的能量和潜力。只使用在人工智能中进行RGB图像培训的模型,没有进行明确的绘图,也没有进行人类任务监督,产生6个包含AI基准的先进结果,这些基准包括目前运行的生境2022年、AI2-THOR重新定位2022年和RoboTHOR的挑战。我们还展示了这些基准的强力零光效果,方法是在ProcTHOR上进行预先培训,没有对下游基准进行微调,常常击打倒以往获得下游数据的国家系统。