Modern vision models typically rely on fine-tuning general-purpose models pre-trained on large, static datasets. These general-purpose models only capture the knowledge within their pre-training datasets, which are tiny, out-of-date snapshots of the Internet -- where billions of images are uploaded each day. We suggest an alternate approach: rather than hoping our static datasets transfer to our desired tasks after large-scale pre-training, we propose dynamically utilizing the Internet to quickly train a small-scale model that does extremely well on the task at hand. Our approach, called Internet Explorer, explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a desired target dataset. It cycles between searching for images on the Internet with text queries, self-supervised training on downloaded images, determining which images were useful, and prioritizing what to search for next. We evaluate Internet Explorer across several datasets and show that it outperforms or matches CLIP oracle performance by using just a single GPU desktop to actively query the Internet for 30--40 hours. Results, visualizations, and videos at https://internet-explorer-ssl.github.io/
翻译:现代愿景模型通常依赖在大型静态数据集上预先培训的微调一般用途模型。这些通用模型只是在其培训前数据集中捕捉知识,而培训前数据集是微小的、过时的互联网快照 -- -- 每天都有数十亿图像上传。我们建议采取另一种办法:我们不希望我们的静态数据集在大规模培训前转让到我们期望的任务,而是希望我们的静态数据集在大规模培训后能够传输到我们所需要的任务上,我们提议动态地利用互联网迅速培训一个小规模模型,在手头的任务上效果极好。我们称之为互联网探索者的方法,以自我监督的方式探索网络,以逐步发现相关实例,提高理想目标数据集的性能。这种方法在互联网上搜索图像与文本查询、自我监督的图像培训、确定哪些图像是有用的以及确定下一步搜索的优先次序之间,我们建议对多个数据集的互联网探索者进行评估,并表明它超越或符合CLIP或骨架的性能。我们使用单一的GPU桌面在30-40小时内积极查询互联网。结果、可视化/Obnets.</s>