Masked Autoencoders is a simple yet powerful self-supervised learning method. However, it learns representations indirectly by reconstructing masked input patches. Several methods learn representations directly by predicting representations of masked patches; however, we think using all patches to encode training signal representations is suboptimal. We propose a new method, Masked Modeling Duo (M2D), that learns representations directly while obtaining training signals using only masked patches. In the M2D, the online network encodes visible patches and predicts masked patch representations, and the target network, a momentum encoder, encodes masked patches. To better predict target representations, the online network should model the input well, while the target network should also model it well to agree with online predictions. Then the learned representations should better model the input. We validated the M2D by learning general-purpose audio representations, and M2D set new state-of-the-art performance on tasks such as UrbanSound8K, VoxCeleb1, AudioSet20K, GTZAN, and SpeechCommandsV2.
翻译:蒙面自动编码器是一种简单而有力的自我监督的学习方法。 但是,它通过重建遮面输入补丁间接地学习了表征。 几种方法直接通过预测遮面补丁的表示方式学习了表征; 但是,我们认为,使用所有补丁将培训信号表示形式编码为不理想。 我们建议了一种新的方法, 蒙面建模Duo (M2D), 在只使用遮面的补丁获取培训信号的同时直接学习表征。 在M2D 中, 在线网络将可见的补丁编码成编码, 预测隐藏的补丁表示方式, 以及目标网络, 一种动力编码, 隐藏的补丁。 为了更好地预测目标表示方式, 在线网络应该很好地建模这些输入模式, 而目标网络也应该很好地建模它与在线预测一致。 然后, 学习的表征应该更好地建模。 我们通过学习通用的音表示方式来验证M2D, 并且 M2D 设置了城市Sound8K、 VoxCeleb1、 AudSet20K、 GTZAN、 和SpeemelcommelandV2V2等任务的新状态艺术表现。