Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data and can often be obtained from distributions beyond the source distribution as well. However, existing distribution shift benchmarks with unlabeled data do not reflect the breadth of scenarios that arise in real-world applications. In this work, we present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment. These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities (photos, satellite images, microscope slides, text, molecular graphs). The update maintains consistency with the original WILDS benchmark by using identical labeled training, validation, and test sets, as well as the evaluation metrics. On these datasets, we systematically benchmark state-of-the-art methods that leverage unlabeled data, including domain-invariant, self-training, and self-supervised methods, and show that their success on WILDS is limited. To facilitate method development and evaluation, we provide an open-source package that automates data loading and contains all of the model architectures and methods used in this paper. Code and leaderboards are available at https://wilds.stanford.edu.
翻译:在野外部署的机器学习系统往往经过源分布培训,但在不同的目标分布上部署。未贴标签的数据可以成为减缓这些分布变化的强大杠杆点,因为其通常比标签数据更容易获得,而且往往可以从来源分布以外的分布中获得。然而,现有的配有未贴标签数据的分配转移基准并不反映真实世界应用中出现的情景的广度。在这项工作中,我们展示了WILDS 2.0更新,该更新扩展了WILDS分布变化基准10个数据集中的8个数据集,以包括可实际在部署中获取的经标定的未贴标签数据。这些数据集涵盖范围广泛的应用(从组织学到野生动物保护)、任务(分类、回归和检测)和模式(照片、卫星图像、显微镜幻灯片、文本、分子图等)以及未贴标签的数据转换基准与原始的WILDS基准保持一致。在这些数据设置上,我们系统标定了用于部署过程中可实际获取的未贴标签的未贴标签的无标签的无标签数据。在WILSA中,我们系统地标定了用于自身和升级的版本结构中,提供了一种自我测试的自我测试的方法。