Fast Text-to-Audio Generation with Adversarial Post-Training

Zachary Novack#♭*¶   Zach Evans♭*   Zack Zukowski   Josiah Taylor   CJ Carr   Julian Parker  
Adnan Al-Sinan   Gian Marco Iodice   Julian McAuley#   Taylor Berg-Kirkpatrick#   Jordi Pons  

#University of California, San Diego
Stability AI
Arm
*Shared Contribution
Work done while an intern at Stability AI

Paper 🤗 HF Weights 🤗 HF Paper Code


Figure R
Figure C

Abstract

Text-to-audio systems, while increasingly performant, are slow at inference time, thus making their latency unpractical for many creative applications. We present Adversarial Relativistic-Contrastive (ARC) post- training, the first adversarial acceleration algorithm for diffusion/flow models not based on distillation. While past adversarial post-training methods have struggled to compare against their expensive distillation counterparts, ARC post-training is a simple procedure that (1) extends a recent relativistic adversarial formulation to diffusion/flow post-training and (2) combines it with a novel contrastive discriminator objective to encourage better prompt adherence. We pair ARC post-training with a number optimizations to Stable Audio Open and build a model capable of generating ≈12s of 44.1kHz stereo audio in ≈75ms on an H100, and ≈7s on a mobile edge-device, the fastest text-to-audio model to our knowledge.


Bibtex
@article{Novack2025Fast,
  title={Fast Text-to-Audio Generation with Adversarial Post-Training}, 
  author={Zachary Novack and Zach Evans and Zack Zukowski
        and Josiah Taylor and CJ Carr and Julian Parker
        and Adnan Al-Sinan and Gian Marco Iodice 
        and Julian McAuley and Taylor Berg-Kirkpatrick and Jordi Pons},
  year={2025},
  journel={arXiv:}
}
Random Generations

This section presents audio samples generated by the four different methods (SAO, the pre-trained RF model, RF+Presto distillation, and RF+ARC post training). For each prompt, we display 5 random generations per method.

Prompt: Latin Funk Drumset 115 BPM Stereo
SAO
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Pre-Trained RF
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Presto
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
ARC (ours)
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Prompt: Sports Car Passing By
SAO
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Pre-Trained RF
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Presto
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
ARC (ours)
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Prompt: Fire Crackling
SAO
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Pre-Trained RF
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Presto
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
ARC (ours)
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Prompt: 80 BPM Industrial Ambient Loop
SAO
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Pre-Trained RF
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Presto
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
ARC (ours)
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Prompt: Helicopter Circling Around in Stereo
SAO
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Pre-Trained RF
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Presto
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
ARC (ours)
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Prompt: Water
SAO
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Pre-Trained RF
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Presto
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
ARC (ours)
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5

Audio-to-Audio Style Transfer

This section showcases style transfer examples. We can perform training-free style transfer by interpolating the sample back to some noise level, and generating using our ARC model with a *different* prompt. By varying the noise level, we can control the overall alignment with the reference audio. For each reference generation (seed), we show two different style transfer results.

Style Transfer Example 1
Reference Audio
Loading prompt...
Style Transfer 1
Loading prompt...
Style Transfer 2
Loading prompt...
Style Transfer Example 2
Reference Audio
Loading prompt...
Style Transfer 1
Loading prompt...
Style Transfer 2
Loading prompt...
Style Transfer Example 3
Reference Audio
Loading prompt...
Style Transfer 1
Loading prompt...
Style Transfer 2
Loading prompt...

Chaining Audio2Audio for Full Loop-Driven Compositions

The style transfer results presented above can be chained together to create long-form compositions. This approach allows for the generation of extended audio pieces by sequentially applying our zero-shot style transfer technique using ARC, where the generation of one output becomes the reference for another. The example below demonstrates how this chaining process can allow users to produce coherent and high-quality loop-driven human-AI co-compositions. Note that *every* sound heard here is generated using our ARC model, along with post-processing in Ableton Live.