The world is filled with articulated objects that are difficult to determine how to use from vision alone, e.g., a door might open inwards or outwards. Humans handle these objects with strategic trial-and-error: first pushing a door then pulling if that doesn't work. We enable these capabilities in autonomous agents by proposing "Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR), a probabilistic generative framework that simultaneously generates a distribution of hypotheses about how objects articulate given input observations, captures certainty over hypotheses over time, and infer plausible actions for exploration and goal-conditioned manipulation. We compare our model with existing work in manipulating objects after a handful of exploration actions, on the PartNet-Mobility dataset. We further propose a novel PuzzleBoxes benchmark that contains locked boxes that require multiple steps to solve. We show that the proposed model significantly outperforms the current state-of-the-art articulated object manipulation framework, despite using zero training data. We further improve the test-time efficiency of H-SAUR by integrating a learned prior from learning-based vision models.
翻译:世界充满了清晰的物体,难以确定如何单独从视觉上使用,例如,一扇门可能会打开内向或外向。人类用战略性试镜和操作程序处理这些物体:首先推门,然后拉门,如果不起作用,我们就拉门。我们通过提议“双向尺寸、模拟、动作、更新和重复”(H-SAUR),使这些自主代理器的能力得以实现。H-SAUR)是一个概率化的基因化框架,它同时产生关于物体如何表达特定输入观测的假设的分布,在时间上捕捉到对假设的确定性,并推断出探索和有目标限制的操作的合理行动。我们在PartNet-移动数据集上将我们的模型与在少数探索行动之后操纵物体的现有工作进行比较。我们进一步提出一个新的PUUGUTBox基准,其中包含需要多步才能解决的锁定框框。我们显示,拟议的模型尽管使用了零培训数据,但仍大大超出了当前最先进的设计物体操纵框架。我们通过整合从学习的视觉模型中学习的模型,从而进一步提高H-SAUR的测试-SAUR的测试-时间效率。