Distilling supervision signal from a long sequence to make predictions is a challenging task in machine learning, especially when not all elements in the input sequence contribute equally to the desired output. In this paper, we propose SpanDrop, a simple and effective data augmentation technique that helps models identify the true supervision signal in a long sequence with very few examples. By directly manipulating the input sequence, SpanDrop randomly ablates parts of the sequence at a time and ask the model to perform the same task to emulate counterfactual learning and achieve input attribution. Based on theoretical analysis of its properties, we also propose a variant of SpanDrop based on the beta-Bernoulli distribution, which yields diverse augmented sequences while providing a learning objective that is more consistent with the original dataset. We demonstrate the effectiveness of SpanDrop on a set of carefully designed toy tasks, as well as various natural language processing tasks that require reasoning over long sequences to arrive at the correct answer, and show that it helps models improve performance both when data is scarce and abundant.
翻译:在机器学习过程中,从长序中提取监督信号以作出预测是一项艰巨的任务,特别是当输入序列中并非所有要素都同样有助于预期产出时。 在本文中,我们提议SpanDrop, 这是一种简单有效的数据增强技术, 帮助模型以长序和非常少的例子来识别真正的监督信号。 通过直接操纵输入序列, SpanDrop 随机地将序列的一部分部分在时间上压缩, 并要求模型执行同样的任务以学习反事实, 并实现输入归属。 根据对其属性的理论分析, 我们还提议了一个基于乙型- 贝诺利分布的 SpanDrop 变量, 产生多种扩展序列, 并提供与原始数据集更加一致的学习目标。 我们展示了SpanDrop 在一套精心设计的玩具任务上的有效性, 以及各种自然语言处理任务, 需要对长序列进行推理才能得出正确的答案, 并显示它有助于模型在数据稀少和丰富的情况下改进性能 。