Distribution shifts -- where the training distribution differs from the test distribution -- can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity, these real-world distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present WILDS, a curated collection of 8 benchmark datasets that reflect a diverse range of distribution shifts which naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty mapping. On each dataset, we show that standard training results in substantially lower out-of-distribution than in-distribution performance, and that this gap remains even with models trained by existing methods for handling distribution shifts. This underscores the need for new training methods that produce models which are more robust to the types of distribution shifts that arise in practice. To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations. Code and leaderboards are available at https://wilds.stanford.edu.
翻译:分布变化 -- -- 培训分布与测试分布不同 -- -- 可能大大降低在野外部署的机器学习系统(ML)的准确性。尽管这些真实世界分布变化普遍存在,但这些变化在当今ML社区广泛使用的数据集中的代表性不足。为了解决这一差距,我们介绍了世界综合发展系统,它汇集了8个基准数据集,反映了在现实世界应用中自然产生的分布变化的多样性,这些变化包括:在医院之间转移肿瘤识别;在野生生物监测的相机陷阱之间;以及卫星成像和贫困绘图的时间和地点之间。我们在每个数据集中都显示,标准培训的结果是分配之外远远低于分配中的绩效,而这种差距甚至与通过现有分配变化处理方法培训的模式仍然存在差距。这突出表明,需要采用新的培训方法来生成模型,这些模型对实际中出现的分布变化类型更为强大。为了便利方法的开发,我们提供了一个开放源软件包,使数据集自动装装,含有默认的模型架构和超光谱仪,并使评价标准化。在 https://wildford.stand.