Research into the area of multiparty dialog has grown considerably over recent years. We present the Molweni dataset, a machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni's source samples from the Ubuntu Chat Corpus, including 10,000 dialogs comprising 88,303 utterances. We annotate 30,066 questions on this corpus, including both answerable and unanswerable questions. Molweni also uniquely contributes discourse dependency annotations in a modified Segmented Discourse Representation Theory (SDRT; Asher et al., 2016) style for all of its multiparty dialogs, contributing large-scale (78,245 annotated discourse relations) data to bear on the task of multiparty dialog discourse parsing. Our experiments show that Molweni is a challenging dataset for current MRC models: BERT-wwm, a current, strong SQuAD 2.0 performer, achieves only 67.7% F1 on Molweni's questions, a 20+% significant drop as compared against its SQuAD 2.0 performance.
翻译:对多党对话领域的研究近年来有了相当大的发展。 我们展示了Molweni数据集(Molweni数据集),这是一个机器阅读理解(MRC)数据集,由多党对话建立的谈话结构组成。Molweni的Ubuntu Chat Corpus的源样本,包括由88,303个发音组成的10 000个对话框。我们注意到了30,066个关于此元素的问题,包括可回答和无法回答的问题。Molweni还独有地在修改的分会演示理论(SDRT;Asher等人,2016年)中为所有多党对话风格提供话依赖性说明,为多党对话任务贡献了大规模(78,245个附加说明的谈话关系)数据。我们的实验显示,Molweni是当前MRC模型的具有挑战性的数据集:BERT-wwm,一个当前强大的SQuAD 2.0表演者,在Molweni的提问上只达到67.7%的F1,与SuAD 2.0的性表现相比,显著下降20。