We present a new benchmark dataset, Sapsucker Woods 60 (SSW60), for advancing research on audiovisual fine-grained categorization. While our community has made great strides in fine-grained visual categorization on images, the counterparts in audio and video fine-grained categorization are relatively unexplored. To encourage advancements in this space, we have carefully constructed the SSW60 dataset to enable researchers to experiment with classifying the same set of categories in three different modalities: images, audio, and video. The dataset covers 60 species of birds and is comprised of images from existing datasets, and brand new, expert-curated audio and video datasets. We thoroughly benchmark audiovisual classification performance and modality fusion experiments through the use of state-of-the-art transformer methods. Our findings show that performance of audiovisual fusion methods is better than using exclusively image or audio based methods for the task of video classification. We also present interesting modality transfer experiments, enabled by the unique construction of SSW60 to encompass three different modalities. We hope the SSW60 dataset and accompanying baselines spur research in this fascinating area.
翻译:我们提出了一个新的基准数据集,即Sapsucker Woods 60 (SSSW60),用于推进视听精细分类的研究。虽然我们社区在图像精细的视觉分类方面取得了长足进步,但音频和视频精细分类的对应方相对而言尚未探索。为了鼓励在这一空间取得进展,我们仔细构建了SSW60数据集,使研究人员能够实验以三种不同的方式对同一组类别进行分类:图像、音频和视频。数据集涵盖60种鸟类,由现有数据集的图像和新的、专家精制的视听数据集组成。我们希望SSSW60数据集和配套基线能够促进这一令人着眼领域的研究。