Recently, large-scale pre-trained Vision-and-Language (VL) foundation models have demonstrated remarkable capabilities in many zero-shot downstream tasks, achieving competitive results for recognizing objects defined by as little as short text prompts. However, it has also been shown that VL models are still brittle in Structured VL Concept (SVLC) reasoning, such as the ability to recognize object attributes, states, and inter-object relations. This leads to reasoning mistakes, which need to be corrected as they occur by teaching VL models the missing SVLC skills; often this must be done using private data where the issue was found, which naturally leads to a data-free continual (no task-id) VL learning setting. In this work, we introduce the first Continual Data-Free Structured VL Concepts Learning (ConStruct-VL) benchmark and show it is challenging for many existing data-free CL strategies. We, therefore, propose a data-free method comprised of a new approach of Adversarial Pseudo-Replay (APR) which generates adversarial reminders of past tasks from past task models. To use this method efficiently, we also propose a continual parameter-efficient Layered-LoRA (LaLo) neural architecture allowing no-memory-cost access to all past models at train time. We show this approach outperforms all data-free methods by as much as ~7% while even matching some levels of experience-replay (prohibitive for applications where data-privacy must be preserved). Our code is publicly available at https://github.com/jamessealesmith/ConStruct-VL
翻译:最近,大规模预训练视觉语言(VL)模型在许多零样例下游任务中展示了非凡的能力,成功地识别出只由简短文本提示定义的对象。然而,也已经表明,VL模型在结构化视觉语言概念(SVLC)推理方面仍然脆弱,例如识别对象属性、状态和对象间关系的能力。这导致推理错误,需要在出现问题的位置上使用私有数据进行纠正,从而自然地导致了数据自由、连续(无任务id)VL学习设置。在这项工作中,我们引入了第一个连续无数据结构化视觉语言概念学习(ConStruct-VL)基准,并表明它对许多现有的无数据CL策略而言是具有挑战性的。因此,我们提出了一种无数据方法,它包括一种新的对抗性伪重演(APR)方法,可以从过去的任务模型中生成对抗性的过去任务提醒。为了有效地使用这种方法,我们还提出了一种连续的参数高效分层LoRA(LaLo)神经结构,允许在训练时无需存储使用所有过去模型的内存成本。我们证明了这种方法的表现优于所有无数据方法,达到了超过7%的水平,甚至匹配了一些经验重演的水平(这对于需要保护数据隐私的应用是禁止的)。我们的代码公开可用:https://github.com/jamessealesmith/ConStruct-VL