In this paper, we tackle two challenges in multimodal learning for visual recognition: 1) when missing-modality occurs either during training or testing in real-world situations; and 2) when the computation resources are not available to finetune on heavy transformer models. To this end, we propose to utilize prompt learning and mitigate the above two challenges together. Specifically, our modality-missing-aware prompts can be plugged into multimodal transformers to handle general missing-modality cases, while only requiring less than 1% learnable parameters compared to training the entire model. We further explore the effect of different prompt configurations and analyze the robustness to missing modality. Extensive experiments are conducted to show the effectiveness of our prompt learning framework that improves the performance under various missing-modality cases, while alleviating the requirement of heavy model re-training. Code is available.
翻译:在本文中,我们应对多式联运学习中的两种挑战,以便目视识别:(1) 当在现实世界中的培训或测试过程中出现缺失模式时;(2) 当计算资源无法用于微调重变压器模型时;和(2) 当计算资源无法用于微调重变压器模型时。 为此,我们提议利用迅速学习并减轻上述两个挑战。 具体地说,我们的模式错换信号可以插入多式联运变压器,以处理一般的缺失模式案例,而与整个模型的培训相比,只需要不到1%的可学习参数。 我们进一步探索不同快速配置的效果,并分析缺失模式的稳健性。 我们进行了广泛的实验,以展示我们快速学习框架的实效,改善各种缺失模式案例的绩效,同时减轻重度模式再培训的要求。</s>