Most crowdsourcing learning methods treat disagreement between annotators as noisy labelings while inter-disagreement among experts is often a good indicator for the ambiguity and uncertainty that is inherent in natural language. In this paper, we propose a framework called Learning Ambiguity from Crowd Sequential Annotations (LA-SCA) to explore the inter-disagreement between reliable annotators and effectively preserve confusing label information. First, a hierarchical Bayesian model is developed to infer ground-truth from crowds and group the annotators with similar reliability together. By modeling the relationship between the size of group the annotator involved in, the annotator's reliability and element's unambiguity in each sequence, inter-disagreement between reliable annotators on ambiguous elements is computed to obtain label confusing information that is incorporated to cost-sensitive sequence labeling. Experimental results on POS tagging and NER tasks show that our proposed framework achieves competitive performance in inferring ground-truth from crowds and predicting unknown sequences, and interpreting hierarchical clustering results helps discover labeling patterns of annotators with similar reliability.
翻译:多数众包学习方法将批发者之间的分歧视为吵闹的标签,而专家之间的分歧往往是自然语言所固有的模糊性和不确定性的良好指标。在本文中,我们提议了一个框架,称为“从人群序列说明(LA-SCA)中学习模糊性”,以探讨可靠的批发者之间的分歧,并有效地保存混乱的标签信息。首先,开发了一种等级分级的巴伊西亚模型,从人群和具有类似可靠性的批发者组别中推导地面真相。通过模拟所涉批发者规模之间的关系,说明方的可靠性和元素在每个序列中的模糊性,计算可靠的批发者之间关于模糊性要素的不一致性,以获得纳入成本敏感序列标签的标签混淆信息。POS标记和NER任务的实验结果表明,我们提议的框架在从人群和预测未知序列中推导出地面真相的竞争性表现,以及解释等级组合结果有助于发现具有类似可靠性的批发商的标签模式。