Averaging predictions of a deep ensemble of networks is apopular and effective method to improve predictive performance andcalibration in various benchmarks and Kaggle competitions. However, theruntime and training cost of deep ensembles grow linearly with the size ofthe ensemble, making them unsuitable for many applications. Averagingensemble weights instead of predictions circumvents this disadvantageduring inference and is typically applied to intermediate checkpoints ofa model to reduce training cost. Albeit effective, only few works haveimproved the understanding and the performance of weight averaging.Here, we revisit this approach and show that a simple weight fusion (WF)strategy can lead to a significantly improved predictive performance andcalibration. We describe what prerequisites the weights must meet interms of weight space, functional space and loss. Furthermore, we presenta new test method (called oracle test) to measure the functional spacebetween weights. We demonstrate the versatility of our WF strategy acrossstate of the art segmentation CNNs and Transformers as well as real worlddatasets such as BDD100K and Cityscapes. We compare WF with similarapproaches and show our superiority for in- and out-of-distribution datain terms of predictive performance and calibration.
翻译:对网络的深层组合的预测是提高各种基准和Kaggle竞赛的预测性业绩和校准的流行和有效方法,但随着共同体规模的大小,深层组合的运行时间和培训成本随着共同体规模的大小而线性地增长,使其不适于许多应用。常识性重而不是预测,绕过这种不利推论,通常适用于中间检查站,以减少培训费用。尽管有效,但只有很少的工作改进了对平均重量的理解和性能。在这里,我们重新审视了这一方法,并表明简单的重量组合(WF)战略可以大大改善预测性能和校准。我们描述了这些重量必须在重量空间、功能空间和损失的期数中满足哪些先决条件。此外,我们提出了一种新的测试方法(所谓的甲骨测试),以测量各种重量之间的功能空间。我们展示了我们的WFF战略在艺术分割和变压器中的不同之处,以及真实的世界数据配置,例如BDDD-S-Sirimalalality,我们用BDDD-Servial-Cal-Cal-Calview 和Calviewal-WEWDDDDD-D-D-D-D-C-C-Calvicalview-C-Servicalment Stapment Stal-C-Sal-Sal-Serview 和Cal-Sal-Sal-Supal-Sy-SDDRism-SD-S-S-Sal-SD-C-C-SD-SD-SD-SD-SD-SD-C-C-C-SD-C-C-SDRD-SD-SD-C-S-SD-SD-SD-SD-SD-SD-SD-SD-C-C-SD-C-S-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-Sal-S-S-S-S-S-S-S-S-S-SD-S-S-S-SDARDADAD-S-S-S-S-S-S-S-S-