Selecting an effective training signal for tasks in natural language processing is difficult: collecting expert annotations is expensive, and crowd-sourced annotations may not be reliable. At the same time, recent work in machine learning has demonstrated that learning from soft-labels acquired from crowd annotations can be effective, especially when there is distribution shift in the test set. However, the best method for acquiring these soft labels is inconsistent across tasks. This paper proposes new methods for acquiring soft-labels from crowd-annotations by aggregating the distributions produced by existing methods. In particular, we propose to find a distribution over classes by learning from multiple-views of crowd annotations via temperature scaling and finding the Jensen-Shannon centroid of their distributions. We demonstrate that using these aggregation methods leads to best or near-best performance across four NLP tasks on out-of-domain test sets, mitigating fluctuations in performance when using the constituent methods on their own. Additionally, these methods result in best or near-best uncertainty estimation across tasks. We argue that aggregating different views of crowd-annotations as soft-labels is an effective way to ensure performance which is as good or better than the best individual view, which is useful given the inconsistency in performance of the individual methods.
翻译:很难为自然语言处理中的任务选择有效的培训信号:收集专家说明费用昂贵,众源说明可能不可靠。 与此同时,最近机器学习工作表明,从人群说明中获取的软标签学习是有效的,特别是当测试集的分布转移时。然而,获取这些软标签的最佳方法各任务之间不一致。本文件建议采用新方法,通过汇总现有方法产生的分布,从人群说明中获取软标签。特别是,我们提议通过通过通过温度缩放从多角度学习人群说明,并找到其分布的詹森-沙农非机器人,找到跨班分配。我们证明,使用这些组合方法可以使四个非常规测试集的NLP任务取得最佳或接近最佳的性能。此外,这些方法通过汇总现有方法产生的分布分布,从人群说明中获取软标签的软性能估计得出最佳或近乎最佳的不确定性。我们指出,将人群说明的不同观点汇总为软标签,是确保个人性能的一种有效方法,在个人性能中,这种不一致性是有用的,在个人性能方面,这种方法是有效的。