CATCH：一种模块化的跨域自适应模板与钩子机制 (CATCH: A Modular Cross-domain Adaptive Template with Hook)

Recent advances in Visual Question Answering (VQA) have demonstrated impressive performance in natural image domains, with models like LLaVA leveraging large language models (LLMs) for open-ended reasoning. However, their generalization degrades significantly when transferred to out-of-domain scenarios such as remote sensing, medical imaging, or math diagrams, due to large distributional shifts and the lack of effective domain adaptation mechanisms. Existing approaches typically rely on per-domain fine-tuning or bespoke pipelines, which are costly, inflexible, and not scalable across diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for cross-domain adaptation that improves the generalization of VQA models while requiring minimal changes to their core architecture. Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight modules: a domain classifier to identify the input image type, and a dual adapter mechanism comprising a Prompt Adapter for language modulation and a Visual Adapter for vision feature adjustment. Both modules are dynamically injected via a unified hook interface, requiring no retraining of the backbone model. Experimental results across four domain-specific VQA benchmarks demonstrate that our framework achieves consistent performance gains without retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH provides a scalable and extensible approach to multi-domain VQA, enabling practical deployment across diverse application domains.

翻译：视觉问答（VQA）领域的最新进展在自然图像域中展示了令人印象深刻的性能，诸如LLaVA等模型利用大型语言模型（LLM）进行开放式推理。然而，当迁移到遥感、医学影像或数学图表等域外场景时，由于显著的分布偏移和缺乏有效的域适应机制，其泛化能力会大幅下降。现有方法通常依赖于针对每个域的微调或定制化流程，这些方法成本高昂、灵活性差，且难以跨不同任务扩展。本文提出CATCH，一种即插即用的跨域适应框架，旨在提升VQA模型的泛化能力，同时对其核心架构仅需最小改动。我们的核心思想是通过引入两个轻量级模块来解耦视觉与语言适应：一个用于识别输入图像类型的域分类器，以及一个包含用于语言调制的提示适配器和用于视觉特征调整的视觉适配器的双重适配机制。这两个模块通过统一的钩子接口动态注入，无需对骨干模型进行重新训练。在四个特定领域的VQA基准测试上的实验结果表明，我们的框架在不重新训练骨干模型的情况下实现了持续的性能提升，包括在MathVQA上提升+2.3 BLEU，在MedVQA-RAD上提升+2.6 VQA分数，在ChartQA上提升+3.1 ROUGE分数。这些结果突显了CATCH为多域VQA提供了一种可扩展且可扩展的方法，使得在不同应用领域的实际部署成为可能。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日