Time-frequency (TF) representations in audio synthesis have been increasingly modeled with real-valued networks. However, overlooking the complex-valued nature of TF representations can result in suboptimal performance and require additional modules (e.g., for modeling the phase). To this end, we introduce complex-valued polynomial networks, called APOLLO, that integrate such complex-valued representations in a natural way. Concretely, APOLLO captures high-order correlations of the input elements using high-order tensors as scaling parameters. By leveraging standard tensor decompositions, we derive different architectures and enable modeling richer correlations. We outline such architectures and showcase their performance in audio generation across four benchmarks. As a highlight, APOLLO results in $17.5\%$ improvement over adversarial methods and $8.2\%$ over the state-of-the-art diffusion models on SC09 dataset in audio generation. Our models can encourage the systematic design of other efficient architectures on the complex field.
翻译:音频合成中的时间频率(TF)表示方式越来越多地以实际价值网络为模型,然而,忽视TF表示方式的复杂价值性质,可能会造成不理想的性能,需要额外的模块(例如,为阶段建模)。为此,我们引入了称为APOLLO的复杂价值多边网络,将这种复杂价值的表示方式自然地结合起来。具体地说,APOLLO利用高阶强压参数来捕捉输入元素的高等级相关性。我们通过利用标准高压分解,产生不同的结构,并能够建模更丰富的关联关系。我们用四个基准的音频生成来勾画和展示这些结构的性能。突出的是,APOLLO的结果是,比对抗方法改进了175美元,比音频生成中SC09数据集的先进传播模型增加了8.2美元。我们的模型可以鼓励在复杂领域系统地设计其他高效的结构。