Merging satellite products and ground-based measurements is often required for obtaining precipitation datasets that simultaneously cover large regions with high density and are more accurate than pure satellite precipitation products. Machine and statistical learning regression algorithms are regularly utilized in this endeavour. At the same time, tree-based ensemble algorithms are adopted in various fields for solving regression problems with high accuracy and low computational cost. Still, information on which tree-based ensemble algorithm to select for correcting satellite precipitation products for the contiguous United States (US) at the daily time scale is missing from the literature. In this study, we worked towards filling this methodological gap by conducting an extensive comparison between three algorithms of the category of interest, specifically between random forests, gradient boosting machines (gbm) and extreme gradient boosting (XGBoost). We used daily data from the PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) and the IMERG (Integrated Multi-satellitE Retrievals for GPM) gridded datasets. We also used earth-observed precipitation data from the Global Historical Climatology Network daily (GHCNd) database. The experiments referred to the entire contiguous US and additionally included the application of the linear regression algorithm for benchmarking purposes. The results suggest that XGBoost is the best-performing tree-based ensemble algorithm among those compared...
翻译:在这项工作中,经常使用机器和统计学习回归算法。与此同时,在各个领域采用基于树的混合算法,以高精度和低计算成本解决回归问题。不过,文献中缺少关于哪些基于树的混合算法,用以选择在每日时间范围内纠正毗连的美国(美国)的卫星降水产品的信息。在本研究中,我们努力填补这一方法差距,对利益类别的三种算法进行了广泛的比较,特别是随机森林、梯度加速机(gbm)和极端梯度加速(XGBoost)之间的比较。我们使用了来自PERSIANN(利用人工神经网络从远程感应信息预感动动动)和IMERG(GPMG综合多卫星检索val)的结网式数据集中,我们还利用了地球观测的直线性推算法数据,用于全球测地平线性平比值数据库。我们还利用了全球测地平比值的直径测算结果数据,用于全球测地平基数据库。</s>