Household robots operate in the same space for years. Such robots incrementally build dynamic maps that can be used for tasks requiring remote object localization. However, benchmarks in robot learning often test generalization through inference on tasks in unobserved environments. In an observed environment, locating an object is reduced to choosing from among all object proposals in the environment, which may number in the 100,000s. Armed with this intuition, using only a generic vision-language scoring model with minor modifications for 3d encoding and operating in an embodied environment, we demonstrate an absolute performance gain of 9.84% on remote object grounding above state of the art models for REVERIE and of 5.04% on FAO. When allowed to pre-explore an environment, we also exceed the previous state of the art pre-exploration method on REVERIE. Additionally, we demonstrate our model on a real-world TurtleBot platform, highlighting the simplicity and usefulness of the approach. Our analysis outlines a "bag of tricks" essential for accomplishing this task, from utilizing 3d coordinates and context, to generalizing vision-language models to large 3d search spaces.
翻译:家用机器人在同一个空间运行多年。 这样的机器人会逐步建立动态地图, 可用于需要远程天体定位的任务。 但是, 机器人学习的基准往往通过在未观测的环境中对任务进行推断来测试一般化。 在观测的环境中, 定位对象会从环境中的所有物体提案中做出选择, 数量可能为10万个。 带有这种直觉, 仅使用通用的视觉语言评分模型, 对3个编码进行微小修改, 并在一个包含式环境中操作, 我们展示了9. 84%的绝对性能收益, 超过ReverIE的艺术模型的状态, 以及粮农组织的5. 04%。 当允许对一个环境进行预爆之前, 我们也会超过ReverIE的艺术勘探前方法的先前状态。 此外, 我们在真实世界的TurtBot平台上展示我们的模型, 突出该方法的简单性和有用性。 我们的分析概述了完成这项任务的“ 骗局”, 从使用3D坐标和背景, 到将视觉模型推广到大型3D搜索空间, 。