We introduce a framework for audio source separation using embeddings on a hyperbolic manifold that compactly represent the hierarchical relationship between sound sources and time-frequency features. Inspired by recent successes modeling hierarchical relationships in text and images with hyperbolic embeddings, our algorithm obtains a hyperbolic embedding for each time-frequency bin of a mixture signal and estimates masks using hyperbolic softmax layers. On a synthetic dataset containing mixtures of multiple people talking and musical instruments playing, our hyperbolic model performed comparably to a Euclidean baseline in terms of source to distortion ratio, with stronger performance at low embedding dimensions. Furthermore, we find that time-frequency regions containing multiple overlapping sources are embedded towards the center (i.e., the most uncertain region) of the hyperbolic space, and we can use this certainty estimate to efficiently trade-off between artifact introduction and interference reduction when isolating individual sounds.
翻译:我们引入了一个音频源分离框架,使用嵌入于一个超曲柱体的嵌入器,紧凑地代表声源和时间频率特征之间的等级关系。在文本和图像中以双曲嵌入器件的形式呈现等级关系的最新成功模型的启发下,我们的算法为每个时频箱的混合信号获得双曲嵌入器,并使用双曲软轴层估算面罩。在包含多个人说话和音乐乐器的混合物的合成数据集中,我们的双曲模型在源与扭曲率的比值方面与欧clidean基线相当,在低嵌入尺寸的性能更强。此外,我们发现含有多个重叠源的时间频率区域嵌入超曲空间的中心(即最不确定的区域 ), 我们可以使用这一确定性估算来有效交换个体声音时的人工引进和干扰减少。