Fast Text-to-Audio Generation with Adversarial Post-Training

Fast Text-to-Audio Generation with Adversarial Post-Training

Zachary Novack^#♭*¶ Zach Evans^♭* Zack Zukowski^♭ Josiah Taylor^♭ CJ Carr^♭ Julian Parker^♭
Adnan Al-Sinan^♮ Gian Marco Iodice^♮ Julian McAuley^# Taylor Berg-Kirkpatrick^# Jordi Pons^♭

^#University of California, San Diego
^♭Stability AI
^♮Arm
^*Shared Contribution
^¶Work done while an intern at Stability AI

Paper 🤗 HF Weights 🤗 HF Paper Code

Abstract

Text-to-audio systems, while increasingly performant, are slow at inference time, thus making their latency unpractical for many creative applications. We present Adversarial Relativistic-Contrastive (ARC) post- training, the first adversarial acceleration algorithm for diffusion/flow models not based on distillation. While past adversarial post-training methods have struggled to compare against their expensive distillation counterparts, ARC post-training is a simple procedure that (1) extends a recent relativistic adversarial formulation to diffusion/flow post-training and (2) combines it with a novel contrastive discriminator objective to encourage better prompt adherence. We pair ARC post-training with a number optimizations to Stable Audio Open and build a model capable of generating ≈12s of 44.1kHz stereo audio in ≈75ms on an H100, and ≈7s on a mobile edge-device, the fastest text-to-audio model to our knowledge.

Bibtex

@inproceedings{Novack2025Fast,
  title={Fast Text-to-Audio Generation with Adversarial Post-Training}, 
  author={Zachary Novack and Zach Evans and Zack Zukowski
        and Josiah Taylor and CJ Carr and Julian Parker
        and Adnan Al-Sinan and Gian Marco Iodice 
        and Julian McAuley and Taylor Berg-Kirkpatrick and Jordi Pons},
  year={2025},
  booktitle={IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)}
}

Random Generations

This section presents audio samples generated by the four different methods (SAO, the pre-trained RF model, RF+Presto distillation, and RF+ARC post training). For each prompt, we display 5 random generations per method.

Prompt: Latin Funk Drumset 115 BPM Stereo

SAO

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Pre-Trained RF

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Presto

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

ARC (ours)

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Prompt: Sports Car Passing By

SAO

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Pre-Trained RF

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Presto

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

ARC (ours)

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Prompt: Fire Crackling

SAO

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Pre-Trained RF

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Presto

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

ARC (ours)

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Prompt: 80 BPM Industrial Ambient Loop

SAO

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Pre-Trained RF

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Presto

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

ARC (ours)

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Prompt: Helicopter Circling Around in Stereo

SAO

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Pre-Trained RF

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Presto

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

ARC (ours)

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Prompt: Water

SAO

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Pre-Trained RF

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Presto

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

ARC (ours)

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Audio-to-Audio Style Transfer

This section showcases style transfer examples. We can perform training-free style transfer by interpolating the sample back to some noise level, and generating using our ARC model with a *different* prompt. By varying the noise level, we can control the overall alignment with the reference audio. For each reference generation (seed), we show two different style transfer results.

Style Transfer Example 1

Reference Audio

Loading prompt...

Style Transfer 1

Loading prompt...

Style Transfer 2

Loading prompt...

Style Transfer Example 2

Reference Audio

Loading prompt...

Style Transfer 1

Loading prompt...

Style Transfer 2

Loading prompt...

Style Transfer Example 3

Reference Audio

Loading prompt...

Style Transfer 1

Loading prompt...

Style Transfer 2

Loading prompt...

Chaining Audio2Audio for Full Loop-Driven Compositions

The style transfer results presented above can be chained together to create long-form compositions. This approach allows for the generation of extended audio pieces by sequentially applying our zero-shot style transfer technique using ARC, where the generation of one output becomes the reference for another. The example below demonstrates how this chaining process can allow users to produce coherent and high-quality loop-driven human-AI co-compositions. Note that *every* sound heard here is generated using our ARC model, along with post-processing in Ableton Live.