Classical planning systems have shown great advances in utilizing rule-based human knowledge to compute accurate plans for service robots, but they face challenges due to the strong assumptions of perfect perception and action executions. To tackle these challenges, one solution is to connect the symbolic states and actions generated by classical planners to the robot's sensory observations, thus closing the perception-action loop. This research proposes a visually-grounded planning framework, named TPVQA, which leverages Vision-Language Models (VLMs) to detect action failures and verify action affordances towards enabling successful plan execution. Results from quantitative experiments show that TPVQA surpasses competitive baselines from previous studies in task completion rate.
翻译:经典的规划系统已经显示出利用基于规则的人类知识来为服务机器人计算准确的计划的巨大进展,但由于假定完美 perception 和 action 执行的困难而面临着挑战。为了解决这些挑战,一个解决方案是将经典规划器生成的符号状态和动作连接到机器人的感知观察结果,从而关闭 perception-action 循环。本研究提出了一个视觉上基础的规划框架,命名为 TPVQA,它利用视觉语言模型(VLMs)来检测动作执行失败并验证动作可支配性,以实现成功的计划执行。定量实验的结果表明,TPVQA 在任务完成率方面超越了之前研究的竞争基线。