Motivated by the fact that forward and backward passes of a deep network naturally form symmetric mappings between input and output representations, we introduce a simple yet effective self-supervised vision model pretraining framework inspired by energy-based models (EBMs). In the proposed framework, we model energy estimation and data restoration as the forward and backward passes of a single network without any auxiliary components, e.g., an extra decoder. For the forward pass, we fit a network to an energy function that assigns low energy scores to samples that belong to an unlabeled dataset, and high energy otherwise. For the backward pass, we restore data from corrupted versions iteratively using gradient-based optimization along the direction of energy minimization. In this way, we naturally fold the encoder-decoder architecture widely used in masked image modeling into the forward and backward passes of a single vision model. Thus, our framework now accepts a wide range of pretext tasks with different data corruption methods, and permits models to be pretrained from masked image modeling, patch sorting, and image restoration, including super-resolution, denoising, and colorization. We support our findings with extensive experiments, and show the proposed method delivers comparable and even better performance with remarkably fewer epochs of training compared to the state-of-the-art self-supervised vision model pretraining methods. Our findings shed light on further exploring self-supervised vision model pretraining and pretext tasks beyond masked image modeling.
翻译:深网络的前向和后向传递自然形成输入和产出表示之间对称的对称映射,我们引入了一个简单而有效的由基于能源模型(EBMS)启发的自我监督的自我监督的愿景模型预培训框架。在拟议的框架中,我们将能源估算和数据恢复模型作为单一网络的前向和后向传递,没有任何辅助组件,例如额外的解码器。对于前向传递,我们将一个网络配置一个能源功能,将低能量分数分配给属于未贴标签的数据集和高能量的样本。对于后向传递,我们利用基于梯度的优化沿能源最小化的方向,反复恢复腐败版本的数据。在这种方式中,我们自然地将隐蔽的图像模型模型用于单一愿景模型前前和后向传输。因此,我们的框架现在接受一系列广泛的借口任务,采用不同的数据模型模型,允许模型从掩蔽的图像模型、补分解的光度和图像恢复,包括超级解析、降压和彩色化的模型。我们提出的自我分析方法,我们提出的业绩分析结论以更精确的自我分析方法展示了我们较弱的自我分析方法。