Music generation models can produce high-fidelity coherent accompaniment given complete audio input, but are limited to editing and loop-based workflows. We study real-time audio-to-audio accompaniment: as a model hears an input audio stream (e.g., a singer singing), it has to also simultaneously generate in real-time a coherent accompanying stream (e.g., a guitar accompaniment). In this work, we propose a model design considering inevitable system delays in practical deployment with two design variables: future visibility $t_f$, the offset between the output playback time and the latest input time used for conditioning, and output chunk duration $k$, the number of frames emitted per call. We train Transformer decoders across a grid of $(t_f,k)$ and show two consistent trade-offs: increasing effective $t_f$ improves coherence by reducing the recency gap, but requires faster inference to stay within the latency budget; increasing $k$ improves throughput but results in degraded accompaniment due to a reduced update rate. Finally, we observe that naive maximum-likelihood streaming training is insufficient for coherent accompaniment where future context is not available, motivating advanced anticipatory and agentic objectives for live jamming.
翻译:现有音乐生成模型能够基于完整音频输入生成高保真连贯的伴奏,但其应用局限于编辑和循环式工作流程。本研究聚焦实时音频到音频的伴奏生成任务:当模型接收输入音频流(如歌手演唱)时,必须同步实时生成连贯的伴奏流(如吉他伴奏)。本工作提出一种考虑实际部署中不可避免系统延迟的模型设计方案,包含两个设计变量:未来可见度$t_f$(输出播放时间与用于条件约束的最新输入时间之间的偏移量)和输出块时长$k$(每次调用生成的帧数)。我们在$(t_f,k)$参数网格上训练Transformer解码器,并揭示两个固有权衡关系:增加有效$t_f$可通过缩小时延间隙提升连贯性,但需要更快的推理速度以满足延迟约束;增加$k$可提高吞吐量,但会因更新频率降低导致伴奏质量下降。最后,我们发现当未来上下文不可用时,朴素的最大似然流式训练方法不足以生成连贯伴奏,这为面向实时即兴演奏的预见性与智能体式训练目标提供了研究动机。