Open-vocabulary models like CLIP achieve high accuracy across many image classification tasks. However, there are still settings where their zero-shot performance is far from optimal. We study model patching, where the goal is to improve accuracy on specific tasks without degrading accuracy on tasks where performance is already adequate. Towards this goal, we introduce PAINT, a patching method that uses interpolations between the weights of a model before fine-tuning and the weights after fine-tuning on a task to be patched. On nine tasks where zero-shot CLIP performs poorly, PAINT increases accuracy by 15 to 60 percentage points while preserving accuracy on ImageNet within one percentage point of the zero-shot model. PAINT also allows a single model to be patched on multiple tasks and improves with model scale. Furthermore, we identify cases of broad transfer, where patching on one task increases accuracy on other tasks even when the tasks have disjoint classes. Finally, we investigate applications beyond common benchmarks such as counting or reducing the impact of typographic attacks on CLIP. Our findings demonstrate that it is possible to expand the set of tasks on which open-vocabulary models achieve high accuracy without re-training them from scratch.
翻译:开放词汇模型(如 CLIP ) 在许多图像分类任务中实现高度精度。 但是, 仍然有一些环境环境, 它们的零点性能远非最佳。 我们研究模型补丁, 目标是提高具体任务的准确性, 而不会降低业绩已经足够的任务的准确性。 为了实现这一目标, 我们引入了 PAINT, 这是一种在微调前对模型的重量和对任务进行微调后对要补补补的重量进行相互交错的补补丁的补丁方法。 在9个任务中, 零点 CLIP 表现差的地方, PAINT 提高了准确性15至60个百分点, 同时在零点模型中保留图像网络的准确性。 PAINT 还允许在多个任务上修补上单一模型, 并改进模型规模。 此外, 我们找出了广泛调用一个任务提高其他任务的准确性的案例, 即使任务分级不连贯。 最后, 我们调查超出共同基准范围的应用, 如计算或减少排版攻击对 CLIP 的影响。 我们的研究结果表明, 有可能扩大开放词汇模型上的高精确度的任务范围。