Vision-and-Language Navigation wayfinding agents can be enhanced by exploiting automatically generated navigation instructions. However, existing instruction generators have not been comprehensively evaluated, and the automatic evaluation metrics used to develop them have not been validated. Using human wayfinders, we show that these generators perform on par with or only slightly better than a template-based generator and far worse than human instructors. Furthermore, we discover that BLEU, ROUGE, METEOR and CIDEr are ineffective for evaluating grounded navigation instructions. To improve instruction evaluation, we propose an instruction-trajectory compatibility model that operates without reference instructions. Our model shows the highest correlation with human wayfinding outcomes when scoring individual instructions. For ranking instruction generation systems, if reference instructions are available we recommend using SPICE.
翻译:利用自动生成的导航指令可以加强视觉和语言导航探测仪。 但是,现有的指令生成器还没有经过全面评价,用于开发它们的自动评价测量仪也没有经过验证。 我们使用人类定位仪显示,这些发电机的性能比基于模板的发电机好,或只比基于模板的发电机好一点,甚至比人类教官差得多。 此外,我们发现BLEU、ROUGE、METEOR和CIDER对评估有根据的导航指令无效。为了改进指令评估,我们建议使用一个没有参考指示的指令-弹道兼容性模型。我们的模型显示,在评分单个指令时,与人类路径测量结果的关系最大。对于排序教学生成系统,如果有参考指示,我们建议使用SPICE。