Vision Transformers (ViTs) have achieved overwhelming success, yet they suffer from vulnerable resolution scalability, i.e., the performance drops drastically when presented with input resolutions that are unseen during training. We introduce, ResFormer, a framework that is built upon the seminal idea of multi-resolution training for improved performance on a wide spectrum of, mostly unseen, testing resolutions. In particular, ResFormer operates on replicated images of different resolutions and enforces a scale consistency loss to engage interactive information across different scales. More importantly, to alternate among varying resolutions, we propose a global-local positional embedding strategy that changes smoothly conditioned on input sizes. This allows ResFormer to cope with novel resolutions effectively. We conduct extensive experiments for image classification on ImageNet. The results provide strong quantitative evidence that ResFormer has promising scaling abilities towards a wide range resolutions. For instance, ResFormer-B-MR achieves a Top-1 accuracy of 75.86% and 81.72% when evaluated on relatively low and high resolutions respectively (i.e., 96 and 640), which are 48% and 7.49% better than DeiT-B. We also demonstrate, among other things, ResFormer is flexible and can be easily extended to semantic segmentation and video action recognition.
翻译:愿景变异者(ViTs)取得了压倒性的成功,然而,他们却遭遇了脆弱的分辨率缩放,然而,他们却获得了巨大的成功,然而,他们却遭受了脆弱的分辨率缩放,也就是说,当表现在培训期间看不到的投入分辨率上时,表现会急剧下降。我们引入了“ResFormer”,这是一个建立在多分辨率培训的开创性理念基础上的框架,目的是在广泛的、大多是看不见的测试分辨率上改进性能。特别是,ResFormer在复制不同分辨率的图像上运作,并强制实施规模一致性损失,以在不同尺度上进行互动信息。更重要的是,为了在不同的决议之间互换,我们提议了一个全球-地方定位嵌入战略,以投入大小为条件,平稳地改变。这让ResFormer能够有效地应对新的分辨率。我们为图像网络的图像分类进行了广泛的实验。结果提供了强有力的量化证据,表明ResFormer公司在广泛分辨率上具有潜在的扩展能力。举例来说,Resformer-BMR在评价相对低和高分辨率(即96和640)时达到75.86%和87.72 %,这比DeIT-B高分辨率更好48%和7.49%,我们还能够展示和Resmer片段。我们还可以很容易地展示和图像部分。我们演示和Systrismation。