What can neural networks learn about the visual world when provided with only a single image as input? While any image obviously cannot contain the multitudes of all existing objects, scenes and lighting conditions - within the space of all 256^(3x224x224) possible 224-sized square images, it might still provide a strong prior for natural images. To analyze this `augmented image prior' hypothesis, we develop a simple framework for training neural networks from scratch using a single image and augmentations using knowledge distillation from a supervised pretrained teacher. With this, we find the answer to the above question to be: `surprisingly, a lot'. In quantitative terms, we find accuracies of 94%/74% on CIFAR-10/100, 69% on ImageNet, and by extending this method to video and audio, 51% on Kinetics-400 and 84% on SpeechCommands. In extensive analyses spanning 13 datasets, we disentangle the effect of augmentations, choice of data and network architectures and also provide qualitative evaluations that include lucid `panda neurons' in networks that have never even seen one.
翻译:当只提供单一图像作为输入时,神经网络能了解什么是视觉世界? 虽然任何图像显然不能包含所有现有物体、场景和照明条件的众多内容, 在所有256 ⁇ ( 3x224x224) 可能的 224 方形图像的空间内, 它仍可能为自然图像提供一个强大的前程。 为了分析这个“ 放大之前的图像” 假设, 我们开发了一个简单的框架, 利用一个单一图像和增强来从零开始培训神经网络, 使用由受过监督的预训的教师提供的知识蒸馏。 因此, 我们发现上述问题的答案是 : “ 惊人的, 很多 ” 。 从数量上看, 我们在 CIFAR- 10/100 上发现了94% 74%, 在图像网络上发现 69%, 并且通过将这一方法扩大到视频和音频, 51% 在 Kinetics- 400 和84% 语音信箱上。 在涵盖13个数据集的广泛分析中, 我们分解了增强的效果,, 数据和网络结构的选择也提供了定性评估, 包括了清晰的“ ” ” 。