Designing dialog tutors has been challenging as it involves modeling the diverse and complex pedagogical strategies employed by human tutors. Although there have been significant recent advances in neural conversational systems using large language models (LLMs) and growth in available dialog corpora, dialog tutoring has largely remained unaffected by these advances. In this paper, we rigorously analyze various generative language models on two dialog tutoring datasets for language learning using automatic and human evaluations to understand the new opportunities brought by these advances as well as the challenges we must overcome to build models that would be usable in real educational settings. We find that although current approaches can model tutoring in constrained learning scenarios when the number of concepts to be taught and possible teacher strategies are small, they perform poorly in less constrained scenarios. Our human quality evaluation shows that both models and ground-truth annotations exhibit low performance in terms of equitable tutoring, which measures learning opportunities for students and how engaging the dialog is. To understand the behavior of our models in a real tutoring setting, we conduct a user study using expert annotators and find a significantly large number of model reasoning errors in 45% of conversations. Finally, we connect our findings to outline future work.
翻译:设计对话辅导程序一直是一项挑战,因为它涉及对人类辅导员所采用的多样化和复杂的教学策略进行建模。虽然最近有大型语言模型 (LLMs) 和可用对话语料库的增长,但对话辅导在很大程度上未受这些进展的影响。在本文中,我们严格分析了两个针对语言学习的对话辅导数据集上的各种生成式语言模型,使用自动评估和人类评估来理解这些进展带来的新机会,以及我们必须克服的挑战,才能构建可以在真实教育环境中使用的模型。我们发现,尽管当前方法可以在概念数量和可能的教师策略较少的受限学习场景中对辅导进行建模,但它们在不受限制的场景中表现不佳。我们的人类质量评估显示,无论是模型还是地面真实注释,在平等辅导方面表现都很差,即它们不能提供给学生平等的学习机会和有吸引力的对话环境。为了了解我们的模型在真正的辅导环境中的行为,我们进行了一项用户研究,使用专家标注者,并发现45%的对话存在大量的模型推理错误。最后,我们将我们的发现联系起来,概述未来的工作。