Modeling visual data as tokens (i.e., image patches), and applying attention mechanisms or feed-forward networks on top of them has shown to be highly effective in recent years. The common pipeline in such approaches includes a tokenization method, followed by a set of layers/blocks for information mixing, both within tokens and among tokens. In common practice, image patches are flattened when converted into tokens, discarding the spatial structure within each patch. Next, a module such as multi-head self-attention captures the pairwise relations among the tokens and mixes them. In this paper, we argue that models can have significant gains when spatial structure is preserved in tokenization, and is explicitly used in the mixing stage. We propose two key contributions: (1) Structure-aware Tokenization and, (2) Structure-aware Mixing, both of which can be combined with existing models with minimal effort. We introduce a family of models (SWAT), showing improvements over the likes of DeiT, MLP-Mixer and Swin Transformer, across multiple benchmarks including ImageNet classification and ADE20K segmentation. Our code and models will be released online.
翻译:将视觉数据建模成象征物(即图像补丁),以及将关注机制或饲料向前网络加到上面,这些方法的常见管道近年来证明非常有效。这些方法的共同管道包括象征性化方法,其次为一组层/区块,在象征物内和象征物间混合信息。在通常做法中,图像补丁在转换成象征物时被固定,丢弃每个补丁内的空间结构。接下来,多头自留式模块等模块捕捉了象征物和混合物之间的对称关系。在本文中,我们指出,当空间结构在象征性化中保存时,模型可以取得重大收益,并在混合阶段明确使用。我们提出了两项关键贡献:(1) 结构自觉化和(2) 结构自觉混合,两者都可以与现有的模型合并,但努力很小。我们引入了一组模型,显示在DeiT、MLP-Mixer和Swin变形器等多个基准中,包括图像网络分类和ADE20-ADE20断段模型,我们的代码和模型将被发布。