Does the dominant approach to learn representations (as a side effect of optimizing an expected cost for a single training distribution) remain a good approach when we are dealing with multiple distributions. Our thesis is that such scenarios are better served by representations that are "richer" than those obtained with a single optimization episode. This is supported by a collection of empirical results obtained with an apparently na\"ive ensembling technique: concatenating the representations obtained with multiple training episodes using the same data, model, algorithm, and hyper-parameters, but different random seeds. These independently trained networks perform similarly. Yet, in a number of scenarios involving new distributions, the concatenated representation performs substantially better than an equivalently sized network trained from scratch. This proves that the representations constructed by multiple training episodes are in fact different. Although their concatenation carries little additional information about the training task under the training distribution, it becomes substantially more informative when tasks or distributions change. Meanwhile, a single training episode is unlikely to yield such a redundant representation because the optimization process has no reason to accumulate features that do not incrementally improve the training performance.
翻译:学习演示的主导方法(作为优化单一培训分布的预期成本的副作用)在我们处理多种分布时,是否仍然是一种好的方法。 我们的理论是,这种情景更适合“更丰富”的演示,而不是一个单一优化集成的演示。这得到一系列经验性结果的支持,这些实验性结果是使用一种显然令人印象深刻的混合技术获得的:利用相同的数据、模型、算法和超参数,以多个培训集成,但使用不同的随机种子。这些独立培训的网络也发挥类似的作用。然而,在一些涉及新分布的情景中,组合式的演示比从零到零的同等规模的网络要好得多。这证明多个培训集成的演示实际上不同。虽然它们的组合在培训分布中几乎没有关于培训任务或分布变化的额外信息,但这种组合会变得信息丰富得多。 与此同时,单个培训集不可能产生这种多余的表述,因为优化进程没有理由积累不会逐步改善培训绩效的特点。