In this paper, inspired by the successes of visionlanguage pre-trained models and the benefits from training with adversarial attacks, we present a novel transformerbased cross-modal fusion modeling by incorporating the both notions for VQA challenge 2021. Specifically, the proposed model is on top of the architecture of VinVL model [19], and the adversarial training strategy [4] is applied to make the model robust and generalized. Moreover, two implementation tricks are also used in our system to obtain better results. The experiments demonstrate that the novel framework can achieve 76.72% on VQAv2 test-std set.
翻译:在本文中,我们借鉴了先入为主的愿景语言培训模式的成功以及对抗性攻击培训的好处,提出了一个新的基于变压器的跨模式融合模型,将 VQA 挑战 2021 的两种概念都纳入其中。 具体地说,拟议模式位于VinVL 模式[19] 结构的顶端,而对抗性培训战略[4] 被用于使该模式稳健和普遍化。此外,我们系统还运用了两种实施技巧来取得更好的结果。 实验表明,新框架可以在VQAv2测试集上达到76.72%。