Recent advances in generative AI for music have achieved remarkable fidelity and stylistic diversity, yet these systems often fail to align with nuanced human preferences due to the specific loss functions they use. This paper advocates for the systematic application of preference alignment techniques to music generation, addressing the fundamental gap between computational optimization and human musical appreciation. Drawing on recent breakthroughs including MusicRL's large-scale preference learning, multi-preference alignment frameworks like diffusion-based preference optimization in DiffRhythm+, and inference-time optimization techniques like Text2midi-InferAlign, we discuss how these techniques can address music's unique challenges: temporal coherence, harmonic consistency, and subjective quality assessment. We identify key research challenges including scalability to long-form compositions, reliability amongst others in preference modelling. Looking forward, we envision preference-aligned music generation enabling transformative applications in interactive composition tools and personalized music services. This work calls for sustained interdisciplinary research combining advances in machine learning, music-theory to create music AI systems that truly serve human creative and experiential needs.
翻译:近期生成式音乐人工智能在保真度和风格多样性方面取得了显著进展,然而由于所采用的特定损失函数,这些系统往往难以与细致入微的人类偏好保持一致。本文主张将偏好对齐技术系统性地应用于音乐生成领域,以弥合计算优化与人类音乐审美之间的根本差距。借鉴包括MusicRL的大规模偏好学习、基于扩散偏好优化的多偏好对齐框架(如DiffRhythm+)、以及推理时优化技术(如Text2midi-InferAlign)在内的最新突破,我们探讨了这些技术如何应对音乐特有的挑战:时序连贯性、和声一致性以及主观质量评估。我们指出了关键的研究挑战,包括长篇幅作品的可扩展性、偏好建模的可靠性等。展望未来,我们预见偏好对齐的音乐生成将在交互式作曲工具和个性化音乐服务中实现变革性应用。本研究呼吁持续开展跨学科研究,结合机器学习与音乐理论的进展,以创建真正服务于人类创作与体验需求的音乐人工智能系统。