Vision-and-Language (VL) pre-training has shown great potential on many related downstream tasks, such as Visual Question Answering (VQA), one of the most popular problems in the VL field. All of these pre-trained models (such as VisualBERT, ViLBERT, LXMERT and UNITER) are built with Transformer, which extends the classical attention mechanism to multiple layers and heads. To investigate why and how these models work on VQA so well, in this paper we explore the roles of individual heads and layers in Transformer models when handling $12$ different types of questions. Specifically, we manually remove (chop) heads (or layers) from a pre-trained VisualBERT model at a time, and test it on different levels of questions to record its performance. As shown in the interesting echelon shape of the result matrices, experiments reveal different heads and layers are responsible for different question types, with higher-level layers activated by higher-level visual reasoning questions. Based on this observation, we design a dynamic chopping module that can automatically remove heads and layers of the VisualBERT at an instance level when dealing with different questions. Our dynamic chopping module can effectively reduce the parameters of the original model by 50%, while only damaging the accuracy by less than 1% on the VQA task.
翻译:视觉和语言(VL)预培训在许多相关的下游任务上都显示出巨大的潜力,例如视觉问答(VQA),这是VL领域最受欢迎的问题之一。所有这些预先训练的模型(如视觉BERT、VilBERT、LXMERT和UNITER)都是与变异器一起建造的,将传统关注机制扩大到多层和多层和多层。为了调查这些模型为什么和如何很好地运用VQA,本文件我们探讨了在处理不同类型问题时,在变异器模型中单个头和层的作用。具体地说,我们一次手工删除(hop)头(或层)从预先训练的视觉-BERT模型中去除(hop),并在不同的问题级别上测试它以记录其表现。正如结果矩阵的有趣的电子形状所示,实验揭示了不同的头和层对不同的问题类型负责,由更高层次的视觉推理问题激活。基于这一观察,我们设计了一个动态的剪裁模块,可以自动删除原始的Q和层次,而仅通过动态的50级标准处理不同级别的模级,而只能通过不同级别的模级降低VLOV-50级任务。