Large-scale datasets are the cornerstone of representation learning. Existing self-supervised approaches extract learning signals by making certain assumptions about the data, e.g., spatio-temporal continuity and multimodal correspondence. However, finding large amounts of data that satisfy such assumptions is not straightforward, and this restricts the community to rely on datasets collected through laborious annotation and/or manual filtering processes. In this paper, we propose a subset optimization approach for automatic dataset curation. Focusing on audio-visual representation learning, we find a subset that provides the maximum mutual information between audio and visual channels in videos. We show that self-supervised models trained on our data, despite being automatically constructed, achieve competitive downstream performances compared to existing datasets that require annotation and/or manual filtering. The most significant benefit of our approach is scalability. We release a dataset of 100M videos with high audio-visual correspondence.
翻译:大规模数据集是代表制学习的基石。现有的自我监督方法通过对数据作出某些假设,例如时空连续性和多式通信等,来获取学习信号。然而,找到大量符合这些假设的数据并非直截了当,这限制了社区依赖通过艰苦的批注和(或)人工过滤程序收集的数据集。在本文中,我们提出了自动数据集整理的子集优化方法。侧重于视听代表制学习,我们发现一个子集,提供视频视听频道之间最大程度的相互信息。我们显示,自监督模型尽管是自动构建的,但与现有数据集相比,需要批注和(或)人工过滤的下游功能具有竞争性。我们方法的最大好处是可缩放性。我们发行了一个100M视频数据集,高声视频通信。