Representation learning has proven to be a powerful methodology in a wide variety of machine learning applications. For atmospheric dynamics, however, it has so far not been considered, arguably due to the lack of large-scale, labeled datasets that could be used for training. In this work, we show that the difficulty is benign and introduce a self-supervised learning task that defines a categorical loss for a wide variety of unlabeled atmospheric datasets. Specifically, we train a neural network on the simple yet intricate task of predicting the temporal distance between atmospheric fields from distinct but nearby times. We demonstrate that training with this task on ERA5 reanalysis leads to internal representations capturing intrinsic aspects of atmospheric dynamics. We do so by introducing a data-driven distance metric for atmospheric states. When employed as a loss function in other machine learning applications, this Atmodist distance leads to improved results compared to the classical $\ell_2$-loss. For example, for downscaling one obtains higher resolution fields that match the true statistics more closely than previous approaches and for the interpolation of missing or occluded data the AtmoDist distance leads to results that contain more realistic fine scale features. Since it is derived from observational data, AtmoDist also provides a novel perspective on atmospheric predictability.
翻译:在各种机器学习应用中,代表性学习被证明是一种强有力的方法。但是,对于大气动态,迄今为止还没有被考虑过,原因可能是缺乏可用于培训的大规模、贴标签的数据集。在这项工作中,我们证明困难是良性的,并引入了自我监督的学习任务,确定了各种无标签的大气数据集的绝对损失。具体地说,我们就预测不同但相近时间的大气字段之间的时间距离这一简单而复杂的任务,对神经网络进行了培训。我们证明,关于ERA5的再分析任务的培训导致内部表现,捕捉到大气动态的内在方面。我们这样做的方法是为大气状态引入一个数据驱动的距离测量仪。在其他机器学习应用中,当作为损失函数,这种阿托莫主义的距离使得与经典的$/ell_2美元损失相比的结果得到改善。具体地说,我们通过缩小一个神经网络获得了比以往方法更接近真实统计的高分辨率领域,并且对缺少或隐蔽的数据进行内部表现,从而得出了大气动态的内在特征。AtmoDast距离导致从更精确的观测结果。