Does the dominant approach to learn representations (as a side effect of optimizing an expected cost for a single training distribution) remain a good approach when we are dealing with multiple distributions? Our thesis is that such scenarios are better served by representations that are richer than those obtained with a single optimization episode. We support this thesis with simple theoretical arguments and with experiments utilizing an apparently na\"{\i}ve ensembling technique: concatenating the representations obtained from multiple training episodes using the same data, model, algorithm, and hyper-parameters, but different random seeds. These independently trained networks perform similarly. Yet, in a number of scenarios involving new distributions, the concatenated representation performs substantially better than an equivalently sized network trained with a single training run. This proves that the representations constructed by multiple training episodes are in fact different. Although their concatenation carries little additional information about the training task under the training distribution, it becomes substantially more informative when tasks or distributions change. Meanwhile, a single training episode is unlikely to yield such a redundant representation because the optimization process has no reason to accumulate features that do not incrementally improve the training performance.
翻译:学习演示的主导方法(作为优化单一培训分布的预期成本的副作用)在我们处理多种分布时是否仍然是一种好的方法?我们的理论是,这种情景更适合比单一优化集成的更富的演示。我们用简单的理论论点和实验来支持这一理论,这些实验使用一种明显的na\"{i}包罗技术:用相同的数据、模型、算法和超参数将多个培训集成,但不同的随机种子使用同样的数据、模型、算法和超参数。这些独立培训的网络也发挥类似的作用。然而,在一些涉及新分布的情景中,配制代表比经过单一培训的同等规模的网络要好得多。这证明,多培训集成的演示实际上不同。虽然它们的组合几乎没有关于培训分布下的培训任务或分布变化的额外信息,但信息却大得多。同时,单一的培训集成不可能产生这种冗余的代表性,因为优化进程没有理由积累不逐步改进培训业绩的特征。