In-Car Conversational Question Answering (ConvQA) systems significantly enhance user experience by enabling seamless voice interactions. However, assessing their accuracy and reliability remains a challenge. This paper explores the use of Large Language Models (LLMs) alongside advanced prompting techniques and agent-based methods to evaluate the extent to which ConvQA system responses adhere to user utterances. The focus lies on contextual understanding and the ability to provide accurate venue recommendations considering user constraints and situational context. To evaluate utterance-response coherence using an LLM, we synthetically generate user utterances accompanied by correct and modified failure-containing system responses. We use input-output, chain-of-thought, self-consistency prompting, and multi-agent prompting techniques with 13 reasoning and non-reasoning LLMs of varying sizes and providers, including OpenAI, DeepSeek, Mistral AI, and Meta. We evaluate our approach on a case study involving restaurant recommendations. The most substantial improvements occur for small non-reasoning models when applying advanced prompting techniques, particularly multi-agent prompting. However, reasoning models consistently outperform non-reasoning models, with the best performance achieved using single-agent prompting with self-consistency. Notably, DeepSeek-R1 reaches an F1-score of 0.99 at a cost of 0.002 USD per request. Overall, the best balance between effectiveness and cost-time efficiency is reached with the non-reasoning model DeepSeek-V3. Our findings show that LLM-based evaluation offers a scalable and accurate alternative to traditional human evaluation for benchmarking contextual understanding in ConvQA systems.
翻译:车载对话式问答系统通过实现无缝语音交互显著提升了用户体验,但其准确性与可靠性的评估仍面临挑战。本文探讨了利用大语言模型结合先进提示技术与基于智能体的方法,以评估车载对话式问答系统响应在多大程度上遵循用户话语。研究重点在于上下文理解能力,以及在考虑用户约束与情境上下文时提供准确场所推荐的能力。为使用大语言模型评估话语-响应一致性,我们合成了用户话语,并配以正确及包含修改后错误的系统响应。我们采用了输入输出、思维链、自洽性提示以及多智能体提示技术,测试了来自OpenAI、深度求索、Mistral AI和Meta等提供商的13个不同规模与类型的推理及非推理大语言模型。我们在一个涉及餐厅推荐的案例研究中评估了所提方法。应用先进提示技术(尤其是多智能体提示)时,小型非推理模型的提升最为显著。然而,推理模型始终优于非推理模型,其中最佳性能通过单智能体自洽性提示实现。值得注意的是,深度求索-R1模型达到了0.99的F1分数,每次请求成本仅为0.002美元。总体而言,非推理模型深度求索-V3在效果与成本-时间效率之间达到了最佳平衡。我们的研究结果表明,基于大语言模型的评估为车载对话式问答系统的上下文理解基准测试提供了一种可扩展且准确的替代方案,可替代传统的人工评估。