Gradient-based learning algorithms have an implicit simplicity bias which in effect can limit the diversity of predictors being sampled by the learning procedure. This behavior can hinder the transferability of trained models by (i) favoring the learning of simpler but spurious features -- present in the training data but absent from the test data -- and (ii) by only leveraging a small subset of predictive features. Such an effect is especially magnified when the test distribution does not exactly match the train distribution -- referred to as the Out of Distribution (OOD) generalization problem. However, given only the training data, it is not always possible to apriori assess if a given feature is spurious or transferable. Instead, we advocate for learning an ensemble of models which capture a diverse set of predictive features. Towards this, we propose a new algorithm D-BAT (Diversity-By-disAgreement Training), which enforces agreement among the models on the training data, but disagreement on the OOD data. We show how D-BAT naturally emerges from the notion of generalized discrepancy, as well as demonstrate in multiple experiments how the proposed method can mitigate shortcut-learning, enhance uncertainty and OOD detection, as well as improve transferability.
翻译:基于梯度的学习算法具有隐含的简单偏差,实际上会限制通过学习程序抽样的预测者的多样性。这种行为会妨碍经过训练的模型的可转让性,其方法是:(一) 倾向于学习培训数据中存在但测试数据中不存在的简单而虚假的特征,以及(二) 仅利用一小部分预测特征,这种效果在测试分布不完全符合培训分布时特别放大 -- -- 称之为 " 批发(OOOD)一般化问题 " 。然而,仅从培训数据来看,不一定能够优先评估某一特性是否具有虚假或可转让性。相反,我们主张学习一系列模型,这些模型能够捕捉到一套不同的预测特征。为此,我们提议采用新的算法D-BAT(Dversity-By-Discod Contlection Train),该算法在培训数据模型之间执行协议,但在OODD数据上存在分歧。我们指出,D-BAT如何自然地从普遍差异的概念中出现,以及在许多实验中显示,拟议的方法如何改进了可变性,从而改进了OD的探测和学习。