Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders. Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing. We provide a hypothesis on the implication of stop-gradient, and further show proof-of-concept experiments verifying it. Our "SimSiam" method achieves competitive results on ImageNet and downstream tasks. We hope this simple baseline will motivate people to rethink the roles of Siamese architectures for unsupervised representation learning. Code will be made available.
翻译:亚马逊网络已成为最近各种未经监督的视觉演示学习模式中常见的结构。 这些模型使一个图像两个放大器之间的相似性最大化, 但要符合避免崩溃解决方案的某些条件。 在本文中, 我们报告令人惊讶的经验结果, 简单的暹马网络即使没有使用以下任何一种方法也能学到有意义的表达方式:(一) 负样对, (二) 大批量, (三) 动力编码器。 我们的实验显示, 损失和结构确实存在崩溃的解决方案, 但一个停顿式的操作在防止崩溃方面起着至关重要的作用。 我们给出了一个关于停顿式的假设, 并进一步展示了概念验证实验。 我们的“ 模拟” 方法在图像网络和下游任务上取得了竞争性的结果。 我们希望这一简单的基础将激励人们重新思考西亚结构的作用, 用于不受监督的演示学习。 代码将会被提供 。