Everyday conversations require understanding everyday events, which in turn, requires understanding temporal commonsense concepts interwoven with those events. Despite recent progress with massive pre-trained language models (LMs) such as T5 and GPT-3, their capability of temporal reasoning in dialogs remains largely under-explored. In this paper, we present the first study to investigate pre-trained LMs for their temporal reasoning capabilities in dialogs by introducing a new task and a crowd-sourced English challenge set, TIMEDIAL. We formulate TIME-DIAL as a multiple-choice cloze task with over 1.1K carefully curated dialogs. Empirical results demonstrate that even the best performing models struggle on this task compared to humans, with 23 absolute points of gap in accuracy. Furthermore, our analysis reveals that the models fail to reason about dialog context correctly; instead, they rely on shallow cues based on existing temporal patterns in context, motivating future research for modeling temporal concepts in text and robust contextual reasoning about them. The dataset is publicly available at: https://github.com/google-research-datasets/timedial.
翻译:日常对话要求了解日常事件,而这反过来又要求理解与这些事件交织在一起的时间常识概念。尽管最近在T5和GPT-3等大规模预先培训语言模型(LMs)方面取得了进步,但是,他们在对话中的时间推理能力在很大程度上仍然没有得到充分探讨。在本文中,我们提出第一份研究,通过引入新的任务和由众人组成的英国挑战集来调查其对话的时间推理能力。我们把时间-现实作为多重选择的凝聚任务,有超过1.1K的仔细整理的对话框。 经验性结果显示,即使是最出色的模型也比人类在这项任务上挣扎,有23个绝对的精确差距点。此外,我们的分析表明,这些模型没有正确解释对话背景;相反,它们依赖基于现有时间模式的浅线条,鼓励未来研究在文字中模拟时间概念,并围绕这些概念进行强有力的背景推理。数据集公布在https://github.com/gogle-regle-reatatas/timedial。