Transformer models are not only successful in natural language processing (NLP) but also demonstrate high potential in computer vision (CV). Despite great advance, most of works only focus on improvement of architectures but pay little attention to the classification head. For years transformer models base exclusively on classification token to construct the final classifier, without explicitly harnessing high-level word tokens. In this paper, we propose a novel transformer model called second-order transformer (SoT), exploiting simultaneously the classification token and word tokens for the classifier. Specifically, we empirically disclose that high-level word tokens contain rich information, which per se are very competent with the classifier and moreover, are complementary to the classification token. To effectively harness such rich information, we propose multi-headed global cross-covariance pooling with singular value power normalization, which shares similar philosophy and thus is compatible with the transformer block, better than commonly used pooling methods. Then, we study comprehensively how to explicitly combine word tokens with classification token for building the final classification head. For CV tasks, our SoT significantly improves state-of-the-art vision transformers on challenging benchmarks including ImageNet and ImageNet-A. For NLP tasks, through fine-tuning based on pretrained language transformers including GPT and BERT, our SoT greatly boosts the performance on widely used tasks such as CoLA and RTE. Code will be available at https://peihuali.org/SoT
翻译:变异器模型不仅在自然语言处理(NLP)中取得成功,而且在计算机视觉(CV)中也显示出巨大的潜力。 尽管取得了巨大的进步,但大多数作品都只关注建筑结构的改进,而很少注意分类头。 年年变异器模型的基础完全以分类符号为基础,用于构建最后分类器,而没有明确使用高层次的单词符号。 在本文中,我们提出了一个叫作二阶变异器(SoT)的新型变异器模型,同时利用分类器的分类符号和字号。 具体地说,我们从经验上广泛披露,高层次的名牌含有丰富的信息,这些信息本身非常适合分类器,而且很少注意分类标码。 为了有效地利用这种丰富的信息,我们建议多头全球交叉变异器模型集成单一的分类器符号来构建最后分类器,而没有明确地使用二阶变变变变器(SoT) 和升级的变压器 。 对于CVIL 任务, 包括具有挑战性能的GPT 网络 和GPT 变压器, 和GPL 图像网络。