AudioLM is a model that generates audio in the waveform domain.
It uses 2 tokenizers: Soundstream
to compute the Acoustic tokens and w2v-BERT
to compute the Semantic tokens.
Soundstream [2] is a SOTA neaural audio codec. The model has 3 parts:
- Encoder
- Residual Vector Quantizer (RVQ)
- Decoder
The convolutional encoder/decoder takes a single channel waveform
- Input: waveform at 16kHz.
- Encoder Embeddings: 50Hz (x320 reduction).
- Codebook symbols:
$Y \in { 1, . . . , N }^{ T_A \times Q}$ where$T_A = T/320$ - Encoded audio:
$enc(x) = R^{S \times D}$ . - One-hot encoded vectors shape:
$S \times D$ - Decoder Embeddings.
- Reconstructed waveform.
Parameters:
- Encoder conv. blocks:
$B_{enc} = 4$ - Decoder conv. blocks:
$B_{dec} = 4$ - Channels:
$C_{enc} = C_{dec}$ - Input samples:
$M$ - One embedding:
$M = 2\cdot 4\cdot 5\cdot 8 = 320$ (4 encoder strides: 2, 4, 5, 8). - Embeddings dimensionality:
$D$ - Number of samples in the time domain:
$T$ - Number of samples of the encoded audio:
$S = T / M$
The adversarial loss is computed with a STFT-based discriminator. The input to the STFTDiscriminator is the complexvalued STFT of the input waveform (real and imaginary parts) and the output, the logits.
w2v-BERT [3] is a Transformer-based model for learning self-supervised audio representations. It maps a input waveform to a set of linguistic features.
Random sampling $ [1:N_q] $ $ \rightarrow $ $ Q_i \quad i=1...n_q $ quantizers
Set $ n_q $ to change the desired bitrate value.
[1] AudioLM: a Language Modeling Approach to Audio Generation