The self-supervised Masked Image Modeling (MIM) schema, following "mask-and-reconstruct" pipeline of recovering contents from masked image, has recently captured the increasing interest in the multimedia community, owing to the excellent ability of learning visual representation from unlabeled data. Aiming at learning representations with high semantics abstracted, a group of works attempts to reconstruct non-semantic pixels with large-ratio masking strategy, which may suffer from "over-smoothing" problem, while others directly infuse semantics into targets in off-line way requiring extra data. Different from them, we shift the perspective to the Fourier domain which naturally has global perspective and present a new Masked Image Modeling (MIM), termed Geminated Gestalt Autoencoder (Ge$^2$-AE) for visual pre-training. Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space, where each other serves as not only the complementation but also the reciprocal constraints. Through this way, more robust representations can be learned in the pre-trained encoders, of which the effectiveness is confirmed by the juxtaposing experimental results on downstream recognition tasks. We also conduct several quantitative and qualitative experiments to investigate the learning behavior of our method. To our best knowledge, this is the first MIM work to solve the visual pre-training through the lens of frequency domain.
翻译:自我监督的蒙面图像模型(MIM) 模型在“ 图像- 重新构建” 从遮面图像中回收内容的“ 图像- 重新构建” 管道后, 获得了对多媒体界的日益浓厚的兴趣, 这是因为从未贴标签的数据中学习视觉代表的极强能力。 以高语义抽取的高语义为目的, 一组工作尝试用大鼠遮面战略来重建非语义像素, 这可能受到“ 过度移动” 问题的影响, 而其他人则直接将语义转换成离线目标, 需要额外数据。 与它们不同, 我们将视角转换到Fourier域, 这个域自然具有全球视角, 并展示新的蒙面图像模型模型(MIM), 称为Geed Gestalt Autencoder (Ge$2$- AE) 用于视觉前培训。 具体地说, 我们的模型配备了精美的解变形像仪, 来重建图像内容, 从平面和频率空间, 重建图像内容, 不仅作为第一补充, 而且还是相互制约, 也是相互制约,, 通过我们 的 学习 学习 方法, 通过这个 学习 方法, 我们的下游过程, 通过 学习 学习, 学习 方法, 可以确认 学习 学习 方法, 通过 学习 学习 的 的 学习 的 方法 的 学习, 学习 。 通过 的 的 的 方法, 通过 学习 学习 学习 学习 学习 方法, 。