Machine learning models based on the aggregated outputs of submodels, either at the activation or prediction levels, lead to strong performance. We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs). First, we show that these two approaches have complementary features whose combination is beneficial. Then, we present partitioned batch ensembles, an efficient ensemble of sparse MoEs that takes the best of both classes of models. Extensive experiments on fine-tuned vision transformers demonstrate the accuracy, log-likelihood, few-shot learning, robustness, and uncertainty calibration improvements of our approach over several challenging baselines. Partitioned batch ensembles not only scale to models with up to 2.7B parameters, but also provide larger performance gains for larger models.
翻译:基于次级模型总产出的机械学习模型,无论是在激活或预测水平上,都会导致强劲的绩效。我们研究了这类模型中两个受欢迎的类别之间的相互作用:神经网络的集合和专家的稀疏混合。首先,我们表明这两种方法具有互补的特征,其组合是有益的。然后,我们展示了分批组合,一个高效的分散的部系组合,它取材于两种模型的最佳类别。关于精细调准的视觉变压器的广泛实验显示了我们在几个具有挑战性的基线上的方法的准确性、日志相似性、少见的学习、稳健性和不确定性校准改进。分批组合不仅向符合2.7B参数的模型扩展,而且还为较大的模型提供了更大的性能收益。