We present a novel attention mechanism: Causal Attention (CATT), to remove the ever-elusive confounding effect in existing attention-based vision-language models. This effect causes harmful bias that misleads the attention module to focus on the spurious correlations in training data, damaging the model generalization. As the confounder is unobserved in general, we use the front-door adjustment to realize the causal intervention, which does not require any knowledge on the confounder. Specifically, CATT is implemented as a combination of 1) In-Sample Attention (IS-ATT) and 2) Cross-Sample Attention (CS-ATT), where the latter forcibly brings other samples into every IS-ATT, mimicking the causal intervention. CATT abides by the Q-K-V convention and hence can replace any attention module such as top-down attention and self-attention in Transformers. CATT improves various popular attention-based vision-language models by considerable margins. In particular, we show that CATT has great potential in large-scale pre-training, e.g., it can promote the lighter LXMERT~\cite{tan2019lxmert}, which uses fewer data and less computational power, comparable to the heavier UNITER~\cite{chen2020uniter}. Code is published in \url{https://github.com/yangxuntu/catt}.
翻译:我们提出了一个新的关注机制:即Causat(CATT),目的是消除现有基于关注的视觉语言模型中不断蔓延的混乱效应;这种效应导致有害偏见,误导关注模块,以关注培训数据中的虚假相关性,损害模型的概括性;由于CATT在总体上没有受到关注,我们利用前门调整来实现因果干预,这不需要对混乱者有任何了解。具体地说,CATT是作为以下组合实施的:(1) 抽样关注(IS-ATT)和(2) 交叉关注(CS-ATT),后者强行将其他样本带入每一项IS-ATT,模仿因果关系干预。CATT遵守Q-K-V公约,因此可以取代任何关注模块,例如上下关注和在变换者中自我关注。CATT大大改进了各种以公众关注为基础的视觉语言模型。我们特别表明,CATT在大规模20前培训中具有巨大的潜力,例如,CS-Smissionlex