Distribution shift occurs when the test distribution differs from the training distribution, and it can considerably degrade performance of machine learning models deployed in the real world. Temporal shifts -- distribution shifts arising from the passage of time -- often occur gradually and have the additional structure of timestamp metadata. By leveraging timestamp metadata, models can potentially learn from trends in past distribution shifts and extrapolate into the future. While recent works have studied distribution shifts, temporal shifts remain underexplored. To address this gap, we curate Wild-Time, a benchmark of 5 datasets that reflect temporal distribution shifts arising in a variety of real-world applications, including patient prognosis and news classification. On these datasets, we systematically benchmark 13 prior approaches, including methods in domain generalization, continual learning, self-supervised learning, and ensemble learning. We use two evaluation strategies: evaluation with a fixed time split (Eval-Fix) and evaluation with a data stream (Eval-Stream). Eval-Fix, our primary evaluation strategy, aims to provide a simple evaluation protocol, while Eval-Stream is more realistic for certain real-world applications. Under both evaluation strategies, we observe an average performance drop of 20% from in-distribution to out-of-distribution data. Existing methods are unable to close this gap. Code is available at https://wild-time.github.io/.
翻译:当测试分布与培训分布不同时,分配就会发生分配变化,它会大大降低在现实世界中部署的机器学习模式的性能。时间变化 -- -- 随着时间的推移而出现的分配变化 -- -- 往往会逐渐发生,并且具有时间戳元数据的额外结构。通过利用时间戳元数据,模型有可能从过去分配变化的趋势中吸取教训,并外推到未来。虽然最近的工作研究了分配变化,但时间变化仍然未得到充分探讨。为了缩小这一差距,我们整理了野时5个数据集的基准,该基准反映了在现实世界各种应用中出现的时间分布变化,包括病人预测和新闻分类。在这些数据集中,我们系统地确定了13个先前的方法,包括域通用方法、持续学习、自我监督学习和共同学习。我们使用两种评价战略:固定时间分割的评价(Eval-Fix)和数据流评价(Eval-Stream)。Eval-Fix,我们的主要评价战略旨在提供一个简单的评价协议,而Eval-Stream 和新闻分类则比较现实世界中无法进行这种平均业绩评估。