In attempts to develop sample-efficient algorithms, researcher have explored myriad mechanisms for collecting and exploiting feature feedback, auxiliary annotations provided for training (but not test) instances that highlight salient evidence. Examples include bounding boxes around objects and salient spans in text. Despite its intuitive appeal, feature feedback has not delivered significant gains in practical problems as assessed on iid holdout sets. However, recent works on counterfactually augmented data suggest an alternative benefit of supplemental annotations: lessening sensitivity to spurious patterns and consequently delivering gains in out-of-domain evaluations. Inspired by these findings, we hypothesize that while the numerous existing methods for incorporating feature feedback have delivered negligible in-sample gains, they may nevertheless generalize better out-of-domain. In experiments addressing sentiment analysis, we show that feature feedback methods perform significantly better on various natural out-of-domain datasets even absent differences on in-domain evaluation. By contrast, on natural language inference tasks, performance remains comparable. Finally, we compare those tasks where feature feedback does (and does not) help.
翻译:在开发具有抽样效率的算法时,研究人员探索了各种收集和利用特征反馈的机制,即为培训(而不是测试)提供的突出证据的辅助说明,例如围绕对象和文字中突出的跨度的框框。尽管具有直觉的吸引力,但特征反馈并没有在根据iid holdout 数据集评估的实际问题方面取得重大收益。然而,最近关于反实际扩大数据的研究表明补充说明的另一个好处是:降低对虚假模式的敏感性,从而在外部评价中带来收益。受这些调查结果的启发,我们假设,虽然现有的许多纳入特征反馈的方法在样本中取得了微不足道的收益,但它们可能只是一般化的。在处理情绪分析的实验中,我们表明,在各种自然的外部数据集上,特征反馈方法效果要好得多,即使没有在内部评价上出现差异。相比之下,在自然语言推断任务方面,绩效仍然具有可比性。最后,我们比较了特征反馈确实(和没有)有帮助的那些任务。