Interpretable representation learning has been playing a key role in creative intelligent systems. In the music domain, current learning algorithms can successfully learn various features such as pitch, timbre, chord, texture, etc. However, most methods rely heavily on music domain knowledge. It remains an open question what general computational principles give rise to interpretable representations, especially low-dim factors that agree with human perception. In this study, we take inspiration from modern physics and use physical symmetry as a self-consistency constraint for the latent space. Specifically, it requires the prior model that characterises the dynamics of the latent states to be equivariant with respect to certain group transformations. We show that physical symmetry leads the model to learn a linear pitch factor from unlabelled monophonic music audio in a self-supervised fashion. In addition, the same methodology can be applied to computer vision, learning a 3D Cartesian space from videos of a simple moving object without labels. Furthermore, physical symmetry naturally leads to representation augmentation, a new technique which improves sample efficiency.
翻译:在创意智能系统中,解释性代表性学习一直发挥着关键作用。在音乐领域,当前的学习算法可以成功地学习各种特征,如音轨、音调、和弦、纹理等。然而,大多数方法都严重依赖音乐领域知识。它仍然是一个未决问题,一般计算原则导致哪些可解释的表述,特别是与人类感知一致的低维因素。在本研究中,我们从现代物理学中汲取灵感,并使用物理对称作为潜在空间的自我一致性制约。具体地说,它要求先前的模型将潜伏状态的动态定性为相对于某些群体变异的等异性。我们显示,物理对称性引导模型从无标签的单声乐音中以自我控制的方式学习线性音频系数。此外,同样的方法也可以应用于计算机视觉,从一个没有标签的简单移动物体的视频中学习3D卡提斯空间。此外,物理对称性自然导致代表增强,这是一种提高抽样效率的新技术。</s>