In many practical applications of machine learning data arrives sequentially over time in large chunks. Practitioners have then to decide how to allocate their computational budget in order to obtain the best performance at any point in time. Online learning theory for convex optimization suggests that the best strategy is to use data as soon as it arrives. However, this might not be the best strategy when using deep non-linear networks, particularly when these perform multiple passes over each chunk of data rendering the overall distribution non i.i.d.. In this paper, we formalize this learning setting in the simplest scenario in which each data chunk is drawn from the same underlying distribution, and make a first attempt at empirically answering the following questions: How long should the learner wait before training on the newly arrived chunks? What architecture should the learner adopt? Should the learner increase capacity over time as more data is observed? We probe this learning setting using convolutional neural networks trained on classic computer vision benchmarks as well as a large transformer model trained on a large-scale language modeling task. Code is available at \url{www.github.com/facebookresearch/ALMA}.
翻译:在机器学习数据的许多实际应用中,机器学习数据会随着时间的流逝而相继出现。 实践者们随后必须决定如何分配计算预算, 以便在任何时刻取得最佳业绩。 康韦克斯优化的在线学习理论表明, 最好的战略是一旦到达, 即使用数据。 然而, 这也许不是使用深非线性网络的最佳策略, 特别是当这些数据在每一块数据上执行多重传球, 从而实现整体分布不为i. id. 。 在本文中, 我们正式确定了这一学习环境, 最简单的假设是, 每个数据块都是从同一基本分布中提取的, 并首次尝试从经验上回答下列问题: 学习者在新到达的块上培训要等多久? 学习者应该采用什么样的架构? 当观察到更多的数据时, 学习者是否随着时间增加能力? 我们使用经过经典计算机视觉基准培训的革命神经网络以及经过大规模语言建模训练的大型变压器模型来探索这一学习环境。 代码可在\ url{www.github.com/facebreggressearction/MA}。