Machine learning (ML) has shown significant promise in studying complex geophysical dynamical systems, including turbulence and climate processes. Such systems often display sensitive dependence on initial conditions, reflected in positive Lyapunov exponents, where even small perturbations in short-term forecasts can lead to large deviations in long-term outcomes. Thus, meaningful inference requires not only accurate short-term predictions, but also consistency with the system's long-term attractor that is captured by the marginal distribution of state variables. Existing approaches attempt to address this challenge by incorporating spatial and temporal dependence, but these strategies become impractical when data are extremely sparse. In this work, we show that prior knowledge of marginal distributions offers valuable complementary information to short-term observations, motivating a distribution-informed learning framework. We introduce a calibration algorithm based on normalization and the Kernelized Stein Discrepancy (KSD) to enhance ML predictions. The method here employs KSD within a reproducing kernel Hilbert space to calibrate model outputs, improving their fidelity to known physical distributions. This not only sharpens pointwise predictions but also enforces consistency with non-local statistical structures rooted in physical principles. Through synthetic experiments-spanning offline climatological CO2 fluxes and online quasi-geostrophic flow simulations-we demonstrate the robustness and broad utility of the proposed framework.
翻译:机器学习(ML)在研究复杂地球物理动力系统(包括湍流和气候过程)方面展现出显著潜力。此类系统通常对初始条件表现出敏感依赖性,这反映在正的李雅普诺夫指数上,即短期预测中的微小扰动可能导致长期结果的巨大偏差。因此,有意义的推断不仅需要准确的短期预测,还需与系统状态变量的边缘分布所捕获的长期吸引子保持一致。现有方法试图通过结合空间和时间依赖性来应对这一挑战,但当数据极其稀疏时,这些策略变得不切实际。在本研究中,我们证明边缘分布的先验知识为短期观测提供了宝贵的补充信息,从而启发了一种分布感知的学习框架。我们引入了一种基于归一化和核化斯坦因差异(KSD)的校准算法,以增强机器学习预测。该方法在再生核希尔伯特空间中使用KSD来校准模型输出,提高其与已知物理分布的一致性。这不仅提升了逐点预测的精度,还确保了与基于物理原理的非局部统计结构的一致性。通过合成实验——涵盖离线气候CO2通量和在线准地转流模拟——我们证明了所提出框架的鲁棒性和广泛适用性。