Existing audio analysis methods generally first transform the audio stream to spectrogram, and then feed it into CNN for further analysis. A standard CNN recognizes specific visual patterns over feature map, then pools for high-level representation, which overlooks the positional information of recognized patterns. However, unlike natural image, the semantic of an audio spectrogram is sensitive to positional change, as its vertical and horizontal axes indicate the frequency and temporal information of the audio, instead of naive rectangular coordinates. Thus, the insensitivity of CNN to positional change plays a negative role on audio spectrogram encoding. To address this issue, this paper proposes a new self-supervised learning mechanism, which enhances the audio representation by first generating adversarial samples (\textit{i.e.}, negative samples), then driving CNN to distinguish the embeddings of negative pairs in the latent space. Extensive experiments show that the proposed approach achieves best or competitive results on 9 downstream datasets compared with previous methods, which verifies its effectiveness on audio representation learning.
翻译:一般而言,现有的音频分析方法通常先将音频流转换成光谱,然后将其输入CNN,以供进一步分析。标准CNN承认地貌图上的具体视觉模式,然后汇集高层代表,忽略了公认的模式的定位信息。然而,与自然图像不同,音频光谱的语义对定位变化十分敏感,因为其垂直轴和水平轴显示的是音频的频率和时间信息,而不是天真的矩形坐标。因此,CNN对定位变化的敏感度在音频光谱编码上起着消极作用。为解决这一问题,本文件提出一个新的自我监督学习机制,通过首先生成对立式样本(\textit{i.e.},负式样本)来增强音频表达,然后驱动CNNC在潜在空间中区分负对子的嵌入。广泛的实验表明,拟议的方法在9个下游数据集上取得了最佳或竞争性的结果,而以前的方法证实了其在音频表达式学习上的有效性。</s>