Deep neural networks perform well on classification tasks where data streams are i.i.d. and labeled data is abundant. Challenges emerge with non-stationary training data streams such as continual learning. One powerful approach that has addressed this challenge involves pre-training of large encoders on volumes of readily available data, followed by task-specific tuning. Given a new task, however, updating the weights of these encoders is challenging as a large number of weights needs to be fine-tuned, and as a result, they forget information about the previous tasks. In the present work, we propose a model architecture to address this issue, building upon a discrete bottleneck containing pairs of separate and learnable key-value codes. Our paradigm will be to encode; process the representation via a discrete bottleneck; and decode. Here, the input is fed to the pre-trained encoder, the output of the encoder is used to select the nearest keys, and the corresponding values are fed to the decoder to solve the current task. The model can only fetch and re-use a sparse number of these key-value pairs during inference, enabling localized and context-dependent model updates. We theoretically investigate the ability of the discrete key-value bottleneck to minimize the effect of learning under distribution shifts and show that it reduces the complexity of the hypothesis class. We empirically verify the proposed method under challenging class-incremental learning scenarios and show that the proposed model - without any task boundaries - reduces catastrophic forgetting across a wide variety of pre-trained models, outperforming relevant baselines on this task.
翻译:深度神经网络在分类任务方面表现良好, 数据流是 i.d.d. 和标签数据是丰富的。 在非固定培训数据流( 如持续学习) 中, 挑战出现。 应对这一挑战的一个强有力方法是, 在随时可获得的数据数量上对大型编码器进行预先培训, 并随后对任务进行具体调整。 但是,由于一项新任务, 更新这些编码器的重量具有挑战性, 因为大量重量需要微调, 因此它们会忘记关于先前任务的信息。 在目前的工作中, 我们提议一个模型来解决这个问题, 建在包含独立和可学习的关键值代码的离散的瓶盖上。 我们的范例将是编码; 通过离散的瓶码处理代表, 并随后进行特定任务的调试。 这里, 输入这些编码器的重量是用来选择最接近的键的键, 并且相应的值被反馈给解析器, 以解析当前任务。 模型只能获取和重新使用一个稀疏的模型, 包含单独和可学习的关键值关键值关键值的边框, 将显示我们学习能力变化的底值变化的模型, 以本地化, 显示, 学习方法的缩缩校底值的校底值 显示, 我们的校底值的校底值, 显示, 学习方法的校底值的校底值的校底值, 。