Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that VLMs struggle off-the-shelf, but with some trade-offs, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and fine-tuning methods. Lastly, our multi-faceted metrics visualize common outcomes, highlighting areas for further improvement.
翻译:程序性错误检测(PMD)是一项具有挑战性的任务,旨在通过第一人称视角视频观察人类用户是否成功执行了由程序性文本指定的任务。尽管近期已有大量研究,但机器在真实场景中的表现仍不可行,且其推理过程不透明。为此,我们将PMD扩展为要求生成视觉自对话式解释来支持决策。鉴于近期视觉-语言模型(VLMs)展现出令人印象深刻且成熟的图像理解能力,我们基于单帧图像构建了一个适用于PMD的基准数据集。由于这一重构实现了前所未有的透明度,我们利用自然语言推理(NLI)模型构建了两个自动化指标,用于评估生成解释的连贯性。我们为这一重构任务建立了基线,结果表明现成的VLMs表现不佳,但通过权衡,将上述指标融入常见的推理和微调方法中,可以提升其准确性、连贯性和效率。最后,我们的多维度指标可视化展示了常见结果,突出了未来改进的方向。