93

Streaming Generation for Music Accompaniment

Main:4 Pages
8 Figures
Bibliography:2 Pages
1 Tables
Appendix:3 Pages
Abstract

Music generation models can produce high-fidelity coherent accompaniment given complete audio input, but are limited to editing and loop-based workflows. We study real-time audio-to-audio accompaniment: as a model hears an input audio stream (e.g., a singer singing), it has to also simultaneously generate in real-time a coherent accompanying stream (e.g., a guitar accompaniment). In this work, we propose a model design considering inevitable system delays in practical deployment with two design variables: future visibility tft_f, the offset between the output playback time and the latest input time used for conditioning, and output chunk duration kk, the number of frames emitted per call. We train Transformer decoders across a grid of (tf,k)(t_f,k) and show two consistent trade-offs: increasing effective tft_f improves coherence by reducing the recency gap, but requires faster inference to stay within the latency budget; increasing kk improves throughput but results in degraded accompaniment due to a reduced update rate. Finally, we observe that naive maximum-likelihood streaming training is insufficient for coherent accompaniment where future context is not available, motivating advanced anticipatory and agentic objectives for live jamming.

View on arXiv
Comments on this paper