A large part of the current success of deep learning lies in the effectiveness of data -- more precisely: labelled data. Yet, labelling a dataset with human annotation continues to carry high costs, especially for videos. While in the image domain, recent methods have allowed to generate meaningful (pseudo-) labels for unlabelled datasets without supervision, this development is missing for the video domain where learning feature representations is the current focus. In this work, we a) show that unsupervised labelling of a video dataset does not come for free from strong feature encoders and b) propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations, by leveraging the natural correspondence between the audio and visual modalities. An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels. We further introduce the first benchmarking results on unsupervised labelling of common video datasets Kinetics, Kinetics-Sound, VGG-Sound and AVE.
翻译:目前深层学习的成功很大程度上在于数据的有效性 -- -- 更确切地说,贴上标签的数据。然而,贴上数据集贴上人文注解标签仍然成本很高,特别是视频费用。在图像领域,最近的方法允许在没有监督的情况下为未贴标签的数据集生成有意义的(假名-)标签,但在视频领域,学习特征表征是当前关注焦点,却缺少这种发展。在这项工作中,我们a)表明,未经监督的视频数据集标签并非没有强大的特征编码器,b)提出一种新的集群方法,通过利用视听模式之间的自然通信,允许在没有人文说明的情况下以假名标出视频数据集。一项广泛的分析显示,由此形成的组群与人类真象标签有很高的语义重叠。我们进一步介绍了关于通用视频数据集“动因技术”、“基尼特-声音”、“VGG-声音”和“AVEV”未经监督的标签的初步基准结果。