We initiate the first empirical study on the use of MLP architectures for vision-and-language (VL) fusion. Through extensive experiments on 5 VL tasks and 5 robust VQA benchmarks, we find that: (i) Without pre-training, using MLPs for multimodal fusion has a noticeable performance gap compared to transformers; (ii) However, VL pre-training can help close the performance gap; (iii) Instead of heavy multi-head attention, adding tiny one-head attention to MLPs is sufficient to achieve comparable performance to transformers. Moreover, we also find that the performance gap between MLPs and transformers is not widened when being evaluated on the harder robust VQA benchmarks, suggesting using MLPs for VL fusion can generalize roughly to a similar degree as using transformers. These results hint that MLPs can effectively learn to align vision and text features extracted from lower-level encoders without heavy reliance on self-attention. Based on this, we ask an even bolder question: can we have an all-MLP architecture for VL modeling, where both VL fusion and the vision encoder are replaced with MLPs? Our result shows that an all-MLP VL model is sub-optimal compared to state-of-the-art full-featured VL models when both of them get pre-trained. However, pre-training an all-MLP can surprisingly achieve a better average score than full-featured transformer models without pre-training. This indicates the potential of large-scale pre-training of MLP-like architectures for VL modeling and inspires the future research direction on simplifying well-established VL modeling with less inductive design bias. Our code is publicly available at: https://github.com/easonnie/mlp-vil
翻译:我们通过对5 VL任务和5 强健 VQA 基准的广泛实验,发现:(一) 没有预培训,使用 MLP 进行多式聚合与变压器相比,业绩差距明显;(二) 然而, VL 预培训可以帮助缩小性能差距;(三) 与大量多头关注相比,对 MLP 增加微小的一头关注足以实现与变压器的可比性能。此外,我们还发现,在对更强的 VQA 基准进行评估时, MLP 和变压器之间的性能差距没有扩大, 这表明, 使用 MLP 用于多式聚合器的性能与变压器的性能差异相当;(三) VL预培训可以有效地学习从低级前的变压器中提取的视觉和文字特征,而无需高度依赖自我保存。 基于这一点,我们更大胆地询问: 我们能否在 VLP 完全的变压模型模型模型中实现全级变压的全式结构, 显示我们ML 的变压的造影机的常规 。