Zachary Novack#♭*¶
Zach Evans♭*
Zack Zukowski♭
Josiah Taylor♭
CJ Carr♭
Julian Parker♭
Adnan Al-Sinan♮
Gian Marco Iodice♮
Julian McAuley#
Taylor Berg-Kirkpatrick#
Jordi Pons♭
#University of California, San Diego
♭Stability AI
♮Arm
*Shared Contribution
¶Work done while an intern at Stability AI
Text-to-audio systems, while increasingly performant, are slow at inference time, thus making their latency unpractical for many creative applications. We present Adversarial Relativistic-Contrastive (ARC) post- training, the first adversarial acceleration algorithm for diffusion/flow models not based on distillation. While past adversarial post-training methods have struggled to compare against their expensive distillation counterparts, ARC post-training is a simple procedure that (1) extends a recent relativistic adversarial formulation to diffusion/flow post-training and (2) combines it with a novel contrastive discriminator objective to encourage better prompt adherence. We pair ARC post-training with a number optimizations to Stable Audio Open and build a model capable of generating ≈12s of 44.1kHz stereo audio in ≈75ms on an H100, and ≈7s on a mobile edge-device, the fastest text-to-audio model to our knowledge.
@article{Novack2025Fast,
title={Fast Text-to-Audio Generation with Adversarial Post-Training},
author={Zachary Novack and Zach Evans and Zack Zukowski
and Josiah Taylor and CJ Carr and Julian Parker
and Adnan Al-Sinan and Gian Marco Iodice
and Julian McAuley and Taylor Berg-Kirkpatrick and Jordi Pons},
year={2025},
journel={arXiv:}
}
This section presents audio samples generated by the four different methods (SAO, the pre-trained RF model, RF+Presto distillation, and RF+ARC post training). For each prompt, we display 5 random generations per method.
This section showcases style transfer examples. We can perform training-free style transfer by interpolating the sample back to some noise level, and generating using our ARC model with a *different* prompt. By varying the noise level, we can control the overall alignment with the reference audio. For each reference generation (seed), we show two different style transfer results.
The style transfer results presented above can be chained together to create long-form compositions. This approach allows for the generation of extended audio pieces by sequentially applying our zero-shot style transfer technique using ARC, where the generation of one output becomes the reference for another. The example below demonstrates how this chaining process can allow users to produce coherent and high-quality loop-driven human-AI co-compositions. Note that *every* sound heard here is generated using our ARC model, along with post-processing in Ableton Live.