Our work examines the way in which large language models can be used for robotic planning and sampling, specifically the context of automated photographic documentation. Specifically, we illustrate how to produce a photo-taking robot with an exceptional level of semantic awareness by leveraging recent advances in general purpose language (LM) and vision-language (VLM) models. Given a high-level description of an event we use an LM to generate a natural-language list of photo descriptions that one would expect a photographer to capture at the event. We then use a VLM to identify the best matches to these descriptions in the robot's video stream. The photo portfolios generated by our method are consistently rated as more appropriate to the event by human evaluators than those generated by existing methods.
翻译:我们的工作考察了大型语言模型可用于机器人规划和取样的方式,特别是自动摄影文件的背景。具体地说,我们通过利用通用语言和视觉语言模型的最新进展,说明如何制作一个具有超常语义意识的摄影机器人。鉴于我们对一项活动的高层次描述,我们用LM来制作一个自然语言的图片描述清单,供摄影师在活动中拍摄。然后我们用VLM来确定与机器人视频流中这些描述的最佳匹配之处。我们使用的方法所产生的照片组合一直被人类评价员评为比现有方法生成的更适合活动。