Self-supervised representation learning follows a paradigm of withholding some part of the data and tasking the network to predict it from the remaining part. Towards this end, masking has emerged as a generic and powerful tool where content is withheld along the sequential dimension, e.g., spatial in images, temporal in audio, and syntactic in language. In this paper, we explore the orthogonal channel dimension for generic data augmentation. The data for each channel is quantized through a non-uniform quantizer, with the quantized value sampled randomly within randomly sampled quantization bins. From another perspective, quantization is analogous to channel-wise masking, as it removes the information within each bin, but preserves the information across bins. We apply the randomized quantization in conjunction with sequential augmentations on self-supervised contrastive models. This generic approach achieves results on par with modality-specific augmentation on vision tasks, and state-of-the-art results on 3D point clouds as well as on audio. We also demonstrate this method to be applicable for augmenting intermediate embeddings in a deep neural network on the comprehensive DABS benchmark which is comprised of various data modalities. Code is availabel at http://www.github.com/microsoft/random_quantize.
翻译:自我监督的演示学习遵循一种模式,即预扣部分数据,并责成网络从剩余部分中预测数据。 为达到这一目的, 遮盖已形成一个通用和强大的工具, 其内容在连续维度上被屏蔽, 例如图像空间、 音频时间和语言合成。 在本文中, 我们探索通用数据增强的正统频道维度。 每个频道的数据通过非单方形的四分仪量化, 随机抽样的定量值在随机抽样的四分位化箱内随机取样。 从另一个角度看, 定量化类似于频道式遮罩, 因为它移除了每个文件夹中的信息, 但却保存了整个文件夹的信息 。 我们在自我监督的对比模型中应用随机化的四分解。 这种通用方法在3D点云和音频上与特定模式的增扩增率相匹配结果。 我们还从另一个角度展示了这一方法, 用于增强中间嵌入的频道遮固度, 包括 ASBRA 数据库 。 在深层 ASBRA 上, 数据库中, 是一个数据库 。