The conversational machine reading comprehension (CMRC) task aims to answer questions in conversations, which has been a hot research topic in recent years because of its wide applications. However, existing CMRC benchmarks in which each conversation is assigned a static passage are inconsistent with real scenarios. Thus, model's comprehension ability towards real scenarios are hard to evaluate reasonably. To this end, we propose the first Chinese CMRC benchmark Orca and further provide zero-shot/few-shot settings to evaluate model's generalization ability towards diverse domains. We collect 831 hot-topic driven conversations with 4,742 turns in total. Each turn of a conversation is assigned with a response-related passage, aiming to evaluate model's comprehension ability more reasonably. The topics of conversations are collected from social media platform and cover 33 domains, trying to be consistent with real scenarios. Importantly, answers in Orca are all well-annotated natural responses rather than the specific spans or short phrase in previous datasets. Besides, we implement three strong baselines to tackle the challenge in Orca. The results indicate the great challenge of our CMRC benchmark. Our datatset and checkpoints are available at https://github.com/nuochenpku/Orca.
翻译:对话机器阅读(CMRC)任务旨在回答对话中的问题,这是近年来一个热题研究话题,因为其应用范围很广。然而,现有的CMRC基准,每场对话被分配到一个静态通道,与真实情景不符。因此,模型对真实情景的理解能力难以合理评估。为此,我们提议中国CMRC第一个基准Orca,并进一步提供零点/few-shot设置,以评价模型对不同域的普及能力。我们收集了831个热点驱动对话,共旋转4 742个。每次对话的转弯都配有与回应有关的段落,目的是更合理地评估模型的理解能力。对话的主题是从社交媒体平台收集的,覆盖了33个领域,试图与真实情景保持一致。重要的是,Orca的答案都是附加说明性很强的自然反应,而不是以往数据集中的具体范围或短句。此外,我们实施了三个强有力的基线,以应对Orca的挑战。结果显示我们C基准的巨大挑战。我们的数据设置和检查站可在 https://ginnuubco/Orjuca查阅。</s>