Existing transformer-based image backbones typically propagate feature information in one direction from lower to higher-levels. This may not be ideal since the localization ability to delineate accurate object boundaries, is most prominent in the lower, high-resolution feature maps, while the semantics that can disambiguate image signals belonging to one object vs. another, typically emerges in a higher level of processing. We present Hierarchical Inter-Level Attention (HILA), an attention-based method that captures Bottom-Up and Top-Down Updates between features of different levels. HILA extends hierarchical vision transformer architectures by adding local connections between features of higher and lower levels to the backbone encoder. In each iteration, we construct a hierarchy by having higher-level features compete for assignments to update lower-level features belonging to them, iteratively resolving object-part relationships. These improved lower-level features are then used to re-update the higher-level features. HILA can be integrated into the majority of hierarchical architectures without requiring any changes to the base model. We add HILA into SegFormer and the Swin Transformer and show notable improvements in accuracy in semantic segmentation with fewer parameters and FLOPS. Project website and code: https://www.cs.toronto.edu/~garyleung/hila/
翻译:基于变压器的现有图像主干网通常向一个方向从低层到高层传播地貌信息。 这可能是不理想的, 因为本地化能力可以划定准确的物体边界, 在低、高分辨率地貌图中最为突出, 而能够淡化属于一个对象相对于另一个对象的图像信号的语义学通常出现在更高的处理级别中。 我们展示了高层次的跨级注意( HILA), 这是一种基于关注的方法, 捕捉不同级别特征之间的自下而上和自上而上更新。 HILA 扩展了高层次的视觉变压器结构, 将高层次和低层次的功能连接到主干线编码编码器中。 我们在每个版本中, 通过具有更高层次的特性来建立等级, 竞争更新属于一个对象的较低层次的图像信号, 并迭接地解决目标部分的关系。 这些改进的低层次特征随后被用于更新更高层次的特性。 HILA可以不需对基本模型作任何改动地融入大多数等级结构。 我们将HILA 添加Segreformer 和SwinLEprodutional- transformationalto 。