Vision Transformer (ViT) extracts the final representation from either class token or an average of all patch tokens, following the architecture of Transformer in Natural Language Processing (NLP) or Convolutional Neural Networks (CNNs) in computer vision. However, studies for the best way of aggregating the patch tokens are still limited to average pooling, while widely-used pooling strategies, such as max and GeM pooling, can be considered. Despite their effectiveness, the existing pooling strategies do not consider the architecture of ViT and the channel-wise difference in the activation maps, aggregating the crucial and trivial channels with the same importance. In this paper, we present Group Generalized Mean (GGeM) pooling as a simple yet powerful pooling strategy for ViT. GGeM divides the channels into groups and computes GeM pooling with a shared pooling parameter per group. As ViT groups the channels via a multi-head attention mechanism, grouping the channels by GGeM leads to lower head-wise dependence while amplifying important channels on the activation maps. Exploiting GGeM shows 0.1%p to 0.7%p performance boosts compared to the baselines and achieves state-of-the-art performance for ViT-Base and ViT-Large models in ImageNet-1K classification task. Moreover, GGeM outperforms the existing pooling strategies on image retrieval and multi-modal representation learning tasks, demonstrating the superiority of GGeM for a variety of tasks. GGeM is a simple algorithm in that only a few lines of code are necessary for implementation.
翻译:愿景变异器( VIT) 在计算机视觉中遵循自然语言处理( NLP) 或进化神经网络( Convolutional Neal Networks) 的变异器结构, 从类标牌或所有补偶标牌的平均值中提取最终代表。 然而, 拼贴标牌的最佳集成方法的研究仍然局限于平均集合, 而广泛使用的集合战略, 如最大和GEM 集合, 是可以考虑的。 尽管其有效性, 现有的集合战略并不考虑 ViT 的架构和启动地图中的频道偏差, 并汇集关键和次要的频道。 在本文中, GGEM 集成( GGEM ) 集合起来, 作为 VIT 的简单而有力的集合战略。 在 ViGM 上, GOGM 显示一个简单的 VIGM 运行模式, 显示一个比VAGM 和 VIGM 的运行模式的0.1%, 展示一个比VAGM 排序的运行基准, 展示一个比VA-BAR IM 任务要高的运行的运行基准和 VIGM 任务。