We present a High-Resolution Transformer (HRT) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution representations and has high memory and computational cost. We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet), along with local-window self-attention that performs self-attention over small non-overlapping image windows, for improving the memory and computation efficiency. In addition, we introduce a convolution into the FFN to exchange information across the disconnected image windows. We demonstrate the effectiveness of the High-Resolution Transformer on both human pose estimation and semantic segmentation tasks, e.g., HRT outperforms Swin transformer by $1.3$ AP on COCO pose estimation with $50\%$ fewer parameters and $30\%$ fewer FLOPs. Code is available at: https://github.com/HRNet/HRFormer.
翻译:我们展示了高分辨率变压器(HRT),该变压器在密集的预测任务中学习高分辨率表示,而原始的视野变压器则产生低分辨率表示,并具有很高的内存和计算成本。我们利用高分辨率变压网络(HRNet)引入的多分辨率平行设计,同时利用对小型非重叠图像窗口进行自我关注的当地窗口自我关注,以提高记忆和计算效率。此外,我们还在FFFFN中引入了一场演进,以在断开的图像窗口之间交流信息。我们展示了高分辨率变压器在人类面貌估计和语义分解任务上的有效性,例如,HRT在COCO上将Swin变压器比AP多1.3美元,其参数少50美元,FLOPs少30美元。代码见:https://github.com/HRNet/HRFormer。