This paper explores a better codebook for BERT pre-training of vision transformers. The recent work BEiT successfully transfers BERT pre-training from NLP to the vision field. It directly adopts one simple discrete VAE as the visual tokenizer, but has not considered the semantic level of the resulting visual tokens. By contrast, the discrete tokens in NLP field are naturally highly semantic. This difference motivates us to learn a perceptual codebook. And we surprisingly find one simple yet effective idea: enforcing perceptual similarity during the dVAE training. We demonstrate that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings, and subsequently help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve 84.5% Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive method BEiT by +1.3 with the same pre-training epochs. It can also improve the performance of object detection and segmentation tasks on COCO val by +1.3 box AP and +1.0 mask AP, semantic segmentation on ADE20k by +1.0 mIoU. Equipped with a larger backbone ViT-H, we achieve the state-of-the-art performance (88.3% Top-1 accuracy) among the methods using only ImageNet-1K data. The code and models will be available at https://github.com/microsoft/PeCo.
翻译:本文为 BERT 预培训视觉变压器探索了更好的代码手册。 最近BeiT 成功的工作将 BERT 预培训成功从 NLP 传输到视觉字段。 它直接将一个简单的离散 VAE 作为视觉表示器, 但没有考虑由此产生的视觉象征的语义等级。 相反, NLP 字段中的离散符号自然具有高度的语义性。 这个差异激励我们学习一个概念性代码手册。 我们惊讶地发现一个简单而有效的想法: 在 dVAE 培训期间, 实施感知性相似性。 我们证明, 拟议的概念代码手册生成的视觉标志确实展示了更好的语义含义, 并随后帮助预培训前在各种下游任务中实现高级转移性能。 例如, 我们用 Vit-B 骨干在图像Net-1K 上实现了84.5%的Top-1 准确度, 仅用相同的培训前的1-1 级缩略图。 它还可以通过 + 1. 3框 AP 和 E1.0 AS- hmax 完成 AS- AS- hard ASup ASy IP ASy IP IP 的运行。