This paper explores a better codebook for BERT pre-training of vision transformers. The recent work BEiT successfully transfers BERT pre-training from NLP to the vision field. It directly adopts one simple discrete VAE as the visual tokenizer, but has not considered the semantic level of the resulting visual tokens. By contrast, the discrete tokens in NLP field are naturally highly semantic. This difference motivates us to learn a perceptual codebook. And we surprisingly find one simple yet effective idea: enforcing perceptual similarity during the dVAE training. We demonstrate that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings, and subsequently help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve 84.5 Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive method BEiT by +1.3 with the same pre-training epochs. It can also improve the performance of object detection and segmentation tasks on COCO val by +1.3 box AP and +1.0 mask AP, semantic segmentation on ADE20k by +1.0 mIoU, The code and models will be available at \url{https://github.com/microsoft/PeCo}.
翻译:本文为 BERT 预培训视觉变压器探索了更好的代码手册。 最近BeiT 成功的工作将 BERT 预培训成功从 NLP 传输到视觉字段。 它直接将一个简单的离散 VAE 作为视觉标记器, 但没有考虑由此产生的视觉标志的语义等级。 相反, NLP 字段中的离散符号自然具有高度的语义性。 这个差异激励我们学习一个概念代码手册。 我们惊讶地发现一个简单而有效的想法: 在 dVAE 培训期间, 实施感知相似性。 我们证明, 拟议的感知代码手册生成的视觉标志确实展示了更好的语义含义, 并随后帮助培训前在各种下游任务中实现高级转移性表现。 例如, 我们用 Vit- B 骨干在图像Net-1K 上实现了84.5 顶级的精度, 以 + 1. 3 显示BEET 竞争方法, 以同样的训练前教义。 我们还可以改进CO Val 的物体探测和分解任务的表现。 AP 和Pequal+1.0 AP 和 Cecommus 将 ASU 用于 的 AP 和Peformus 。