We present the Molweni dataset, a machine reading comprehension (MRC) dataset built over multiparty dialogues. Molweni's source samples from the Ubuntu Chat Corpus, including 10,000 dialogues comprising 88,303 utterances. We annotate 32,700 questions on this corpus, including both answerable and unanswerable questions. Molweni also uniquely contributes discourse dependency annotations for its multiparty dialogues, contributing large-scale (78,246 annotated discourse relations) data to bear on the task of multiparty dialogue understanding. Our experiments show that Molweni is a challenging dataset for current MRC models; BERT-wwm, a current, strong SQuAD 2.0 performer, achieves only 67.7% F1 on Molweni's questions, a 20+% significant drop as compared against its SQuAD 2.0 performance.
翻译:我们展示了Molweni数据集,这是多党对话建立的机器阅读理解(MRC)数据集。Molweni从Ubuntu Chat Corpus的源样本,包括由88,303个发音组成的10 000个对话。我们给出了32,700个关于这个剧本的问题,包括可以回答和无法回答的问题。Molweni还为其多党对话提供了独特的话语依赖说明,为多党对话的任务提供了大规模(78,246个附加说明的谈话关系)数据。我们的实验显示,Molweni是当前MRC模型具有挑战性的数据集;BERT-wm,一个强大的SQuAD2.0表演者,在Molweni的问题上只实现了67.7%的F1,与SquAD 2.0的绩效相比,下降了20 %。