Music mixing involves combining individual tracks into a cohesive mixture, a task characterized by subjectivity where multiple valid solutions exist for the same input. Existing automatic mixing systems treat this task as a deterministic regression problem, thus ignoring this multiplicity of solutions. Here we introduce MEGAMI (Multitrack Embedding Generative Auto MIxing), a generative framework that models the conditional distribution of professional mixes given unprocessed tracks. MEGAMI uses a track-agnostic effects processor conditioned on per-track generated embeddings, handles arbitrary unlabeled tracks through a permutation-equivariant architecture, and enables training on both dry and wet recordings via domain adaptation. Our objective evaluation using distributional metrics shows consistent improvements over existing methods, while listening tests indicate performances approaching human-level quality across diverse musical genres.
翻译:音乐混音涉及将独立音轨融合为连贯的混合音频,这一任务具有主观性特征,同一输入可存在多种有效解决方案。现有自动混音系统将此任务视为确定性回归问题,因而忽略了解决方案的多样性。本文提出MEGAMI(多轨嵌入生成自动混音)框架,该生成式框架建模了未处理音轨条件下专业混音的条件概率分布。MEGAMI采用与音轨无关的效果处理器,其参数由每轨生成的嵌入向量调控;通过置换等变架构处理任意未标注音轨;并借助域适应技术实现干声与湿声录音的联合训练。基于分布度量的客观评估显示,本方法较现有技术获得持续改进,而听感测试表明其在多样化音乐流派中均达到接近人类水平的混音质量。