Training NLP systems typically assumes access to annotated data that has a single human label per example. Given imperfect labeling from annotators and inherent ambiguity of language, we hypothesize that single label is not sufficient to learn the spectrum of language interpretation. We explore new annotation distribution schemes, assigning multiple labels per example for a small subset of training examples. Introducing such multi label examples at the cost of annotating fewer examples brings clear gains on natural language inference task and entity typing task, even when we simply first train with a single label data and then fine tune with multi label examples. Extending a MixUp data augmentation framework, we propose a learning algorithm that can learn from training examples with different amount of annotation (with zero, one, or multiple labels). This algorithm efficiently combines signals from uneven training data and brings additional gains in low annotation budget and cross domain settings. Together, our method achieves consistent gains in two tasks, suggesting distributing labels unevenly among training examples can be beneficial for many NLP tasks.
翻译:培训NLP系统通常假定获得带有单个人类标签的附加说明数据。 鉴于标记不完善,且语言存在固有的模糊性,我们假设单标签不足以学习语言解释的广度。 我们探索新的批注分配方案,为一小组培训实例提供多个标签。 采用这种多标签示例,以说明较少的例子为代价,在自然语言推论任务和实体打字任务上取得了明显收益, 即使我们只是先用单一标签数据进行训练,然后用多标签实例进行微调。 扩展混集数据增强框架, 我们建议一种学习算法, 可以从不同批注( 零、一或多个标签) 的培训实例中学习。 这种算法有效地结合了不均匀的培训数据的信号,并在低注预算和跨域设置方面带来额外收益。 我们的方法在两项任务中取得了一致的收益, 表明在培训示例中分配标签不均有利于许多 NLP 任务 。