We design a novel global-local Transformer named \textbf{Ada-ClustFormer} (\textbf{ACF}) to generate captions. We use this name since each layer of ACF can adaptively cluster input elements to carry self-attention (Self-ATT) for learning local context. Compared with other global-local Transformers which carry Self-ATT in fixed-size windows, ACF can capture varying graininess, \eg, an object may cover different numbers of grids or a phrase may contain diverse numbers of words. To build ACF, we insert a probabilistic matrix C into the Self-ATT layer. For an input sequence {{s}_1,...,{s}_N , C_{i,j} softly determines whether the sub-sequence {s_i,...,s_j} should be clustered for carrying Self-ATT. For implementation, {C}_{i,j} is calculated from the contexts of {{s}_i,...,{s}_j}, thus ACF can exploit the input itself to decide which local contexts should be learned. By using ACF to build the vision encoder and language decoder, the captioning model can automatically discover the hidden structures in both vision and language, which encourages the model to learn a unified structural space for transferring more structural commonalities. The experiment results demonstrate the effectiveness of ACF that we achieve CIDEr of 137.8, which outperforms most SOTA captioning models and achieve comparable scores compared with some BERT-based models. The code will be available in the supplementary material.
翻译:我们设计了一个名为\ textbf{Ada- ClustFormer} (\ textbf{ACF}) 的新全球本地变换器, 以生成标题。 我们使用这个名称, 因为 ACF 的每层都能够适应性地组合输入元素, 以学习本地环境。 与其他在固定大小窗口中进行自控的全球- 本地变换器相比, ACF 能够捕捉不同的质变, 也就是说, 对象可能覆盖不同的网格数量或短语, 可能包含不同的字数 。 为了构建 ACFD, 我们插入了一个概率矩阵 C。 对于输入序列 {s% 1, {\\ n, Ci, j} 以软化的方式决定子序列 {s_ i,..., s_ j} 是否应该集成一个子序列, 以进行自控操作。 对于执行来说, {C\, j} 对象可能包含不同的网格或短语, 包含不同的字数。 为了建立 ACFCF, {s_j}, 因此, 我们可以利用可比较的CFireal rideal rideal ex ex ex modeal ex modeal ex ex model 这样的输入, 这样的输入自己 自己 来在最能 的模型 和 的 和 的模型 来显示 的模型 和 。