Current state-of-the-art vision-and-language models are evaluated on tasks either individually or in a multi-task setting, overlooking the challenges of continually learning (CL) tasks as they arrive. Existing CL benchmarks have facilitated research on task adaptation and mitigating "catastrophic forgetting", but are limited to vision-only and language-only tasks. We present CLiMB, a benchmark to study the challenge of learning multimodal tasks in a CL setting, and to systematically evaluate how upstream continual learning can rapidly generalize to new multimodal and unimodal tasks. CLiMB includes implementations of several CL algorithms and a modified Vision-Language Transformer (ViLT) model that can be deployed on both multimodal and unimodal tasks. We find that common CL methods can help mitigate forgetting during multimodal task learning, but do not enable cross-task knowledge transfer. We envision that CLiMB will facilitate research on a new class of CL algorithms for this challenging multimodal setting.
翻译:现有CL基准促进了任务适应和减轻“灾难性遗忘”的研究,但仅限于只考虑愿景和语言的任务。我们提出了CLIMB,这是研究在CL环境中学习多式任务的挑战的基准,并系统地评估上游持续学习如何能迅速推广到新的多式联运和单式任务。CLIMB包括实施若干CL算法和修改的愿景-语言变换器(VILT)模型,这些模型可以同时用于多式联运和单式任务。我们发现,通用CLL方法有助于在多式任务学习中减轻忘却,但不能促成交叉任务知识转让。我们设想CLIMB将促进关于这一具有挑战性的多式联运环境的新型CL算法的研究。