We introduce distributed NLI, a new NLU task with a goal to predict the distribution of human judgements for natural language inference. We show that models can capture human judgement distribution by applying additional distribution estimation methods, namely, Monte Carlo (MC) Dropout, Deep Ensemble, Re-Calibration, and Distribution Distillation. All four of these methods substantially outperform the softmax baseline. We show that MC Dropout is able to achieve decent performance without any distribution annotations while Re-Calibration can further give substantial improvements when extra distribution annotations are provided, suggesting the value of multiple annotations for the example in modeling the distribution of human judgements. Moreover, MC Dropout and Re-Calibration can achieve decent transfer performance on out-of-domain data. Despite these improvements, the best results are still far below estimated human upper-bound, indicating that the task of predicting the distribution of human judgements is still an open, challenging problem with large room for future improvements. We showcase the common errors for MC Dropout and Re-Calibration. Finally, we give guidelines on the usage of these methods with different levels of data availability and encourage future work on modeling the human opinion distribution for language reasoning.
翻译:我们引入了分布式NLI, 这是一个新的NLU任务, 目的是预测自然语言推断的人类判断的分布。 我们显示模型可以通过应用额外的分配估计方法, 即蒙特卡洛(MC)辍学、深重整、重新校准和分配蒸馏方法, 来捕捉人类判断的分布。 所有这些方法都大大优于软负负基线。 我们显示, MC 辍学在没有任何分发说明的情况下能够取得体面的业绩, 而再校准在提供额外分发说明时可以带来更大的改进, 从而为人类判断的分布模型提供多重说明的价值。 此外, MC 丢弃和再校准还可以在外部数据上取得体面的转移性能。 尽管取得了这些改进, 最佳结果仍然远远低于人类估计的上限。 这表明,预测人类判断的分布仍是一个开放性的问题, 未来改进空间很大。 我们展示了 MC 丢弃和再校准的常见错误。 最后, 我们用不同水平的数据分配模式为使用这些方法提供了使用模式的指导方针,鼓励将来的推理工作。