When fitting statistical models to variables in geoscientific disciplines such as hydrology, it is a customary practice to regionalize - to divide a large spatial domain into multiple regions and study each region separately - instead of fitting a single model on the entire data (also known as unification). Traditional wisdom in these fields suggests that models built for each region separately will have higher performance because of homogeneity within each region. However, by partitioning the training data, each model has access to fewer data points and cannot learn from commonalities between regions. Here, through two hydrologic examples (soil moisture and streamflow), we argue that unification can often significantly outperform regionalization in the era of big data and deep learning (DL). Common DL architectures, even without bespoke customization, can automatically build models that benefit from regional commonality while accurately learning region-specific differences. We highlight an effect we call data synergy, where the results of the DL models improved when data were pooled together from characteristically different regions. In fact, the performance of the DL models benefited from more diverse rather than more homogeneous training data. We hypothesize that DL models automatically adjust their internal representations to identify commonalities while also providing sufficient discriminatory information to the model. The results here advocate for pooling together larger datasets, and suggest the academic community should place greater emphasis on data sharing and compilation.
翻译:当将统计模型与水文等地球科学学科的变量相适应时,区域化是一种习惯做法,即将大的空间领域分为多个区域,并分别研究每个区域,而不是对整个数据(又称统一)设置单一模型。这些领域的传统智慧表明,由于每个区域具有同质性,为每个区域分别建造的模型的性能会更高。然而,通过对培训数据进行分割,每个模型都能够获得较少的数据点,无法从不同区域之间的共同点中学习。事实上,通过两个水文实例(土壤湿度和流流),我们认为,在大数据和深层次学习的时代,统一往往大大超过区域化(DL)。共同的DL结构,即使不进行简单的定制,也可以自动建立从区域共同性中获益的模式,同时准确地了解区域差异。我们强调一种效果,即数据协同性,当数据从不同区域的数据汇集在一起时,每个DL模型的结果会得到改善。事实上,DL模型的业绩会从更多样化而不是更一致的培训数据数据数据中得益。我们假设,DL模型可以自动调整其内部表现方式,即使不作简单的定制化,也可以建立共同性,同时提供更深层次的数据。我们应该提出更多的共同性数据。