Recently, large-scale pre-trained Vision-and-Language (VL) foundation models have demonstrated remarkable capabilities in many zero-shot downstream tasks, achieving competitive results for recognizing objects defined by as little as short text prompts. However, it has also been shown that VL models are still brittle in Structured VL Concept (SVLC) reasoning, such as the ability to recognize object attributes, states, and inter-object relations. This leads to reasoning mistakes, which need to be corrected as they occur by teaching VL models the missing SVLC skills; often this must be done using private data where the issue was found, which naturally leads to a data-free continual (no task-id) VL learning setting. In this work, we introduce the first Continual Data-Free Structured VL Concepts Learning (ConStruct-VL) benchmark and show it is challenging for many existing data-free CL strategies. We, therefore, propose a data-free method comprised of a new approach of Adversarial Pseudo-Replay (APR) which generates adversarial reminders of past tasks from past task models. To use this method efficiently, we also propose a continual parameter-efficient Layered-LoRA (LaLo) neural architecture allowing no-memory-cost access to all past models at train time. We show this approach outperforms all data-free methods by as much as ~7% while even matching some levels of experience-replay (prohibitive for applications where data-privacy must be preserved).
翻译:最近,大规模经过培训的愿景和语言(VL)基础模型在许多零点的下游任务中表现出了非凡的能力,在识别短短文本提示所定义的物体方面实现了竞争结果。然而,还表明VL模型在结构化VL概念(SVLC)推理中仍然不易,例如识别物体属性、状态和跨对象关系的能力。这导致推理错误,需要通过教授VL模型所缺少的 SVLC 技能来纠正;这往往必须利用发现问题的私人数据来完成,这自然导致数据持续(没有任务标识)VL学习设置。在这项工作中,我们引入了第一个持续数据自由结构VL概念学习(ConStruct-VL)基准,并表明它对许多现有的无数据 CL战略具有挑战性。因此,我们建议一种数据自由方法,即Adversarial Pseudo-Replay(APR)的新方法,在这个方法中,我们从过去任务模型中可以产生一个持续的对抗性数据(没有任务)反复提醒,同时从过去的任务模型显示整个任务格式。