How to use Transformers for Audio In AI

To adapt the transformer architecture for audio applications, we can employ the conventional transformer structure outlined above, with a minor adjustment either in the input or output aspect to accommodate audio data instead of text. As these models fundamentally share the transformer architecture, their core architectural components remain similar, with the primary difference is the training methods and the processing of input or output data.

Audio Model Inputs

From the above discussion we can conclude that the input to audio model can either be text or audio.

Text Input

If the input is text then the original transformer architecture works as it is. The input text will be tokenized and passed through an embedding layer to get embedding vectors. This embedding vectors will be passed into the transformer encoder.

Audio Input

If the input is audio , we need to convert it to embedding vectors. In order to convert to embedding vectors we use 1d Convolution called as CNN feature encoder.

There are two approaches as to how we can pass the audio input through the CNN encoder:

  • Feeding the raw audio (waveform audio) input directly into CNN feature encoder.
  • Converting the raw audio into Log Mel Spectrogram or MFCC and then feeding into CNN feature encoder.

Waveform Audio Input

A waveform is a representation of an audio signal or any other continuous signal as a sequence of discrete data points. It’s a one-dimensional sequence of floating-point numbers. Each number in the sequence corresponds to the sampled amplitude of the signal at a particular point in time. These amplitudes are typically measured in terms of voltage or pressure variations in the analog signal. When we speak we cause disturbance in the air causing compression/decompression. This disturbance travels through the air and is called as sound wave. While speaking we generate sound waves with different frequencies simultaneously whose range is between 20 hz to 20000hz . Microphones capture this changes in air pressure and convert it into electrical energy. This electrical energy or signal is continuous and in order to store it we need to convert it to digital signal. This conversion is achieved through sampling that is we take the value of the signal at fixed interval of time and store it. If the sampling rate is 16khz it means in a 1 sec video with 16 khz sampling rate we store 16000 sample value .

Sound Wave Sampling

The above raw input is passed through a feature encoder. A typical feature encoder which was introduced in Wav2Vec2 model and is commonly used contains seven blocks and the temporal convolutions in each block have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths (10,3,3,3,3,2,2). This results in an encoder output frequency of 49 hz with a stride of about 20ms between each sample, and a receptive field of 400 input samples or 25ms of audio. Detail calculation is shown below:


Input Dimension

Kernel/Filter Width


Channel*output Dimension


1 x 16000





512 x 3199





512 x 799





512 x 799





512 x 399





512 x 199





512 x 99




Total Stride = 16000/49 = approx 320 samples

16 khz = 1 sec . Therefore 320 samples = 20 ms

A typical process of feeding a raw audio input to Transformer Encoder

Spectrogram Inputs

A drawback associated with utilizing the raw waveform as input is its long sequence lengths. To illustrate, consider thirty seconds of audio recorded at a 16 kHz sampling rate, resulting in an input length of 30 * 16000 = 480,000. Longer sequence lengths results in increased computational demands within the transformer model, leading to high memory usage. Because of this, raw audio waveforms are not usually the most efficient form of representing an audio input. By using a spectrogram, we get the same amount of information but in a more compressed form. Two popular ways are to convert the raw input into MEL Spectrogram or MFCC. Lets discuss both :

  • Mel Spectrogram
    • In the real world in the most audio signals the frequency content varies over time. This is where the spectrogram comes in picture. We take a sound signal and divide the signal in small windows. We then compute Fourier transform of this windows. We plot the spectrum obtained for all the windows together which is know as spectrogram.
    • The human ear has limitations in distinguishing closely spaced frequencies. This limitation becomes more noticeable as the frequencies get higher. For instance, we can easily tell the difference between a 500Hz and a 1000Hz frequency, but we struggle to differentiate between a 15000Hz and a 15500Hz frequency, even though the numerical difference is the same. In 1937, Stevens, Volkmann, and Newmann introduced the concept of the Mel scale, which provides a unit of pitch where equal pitch intervals sound equally spaced to a listener. To convert frequencies to the mel scale, we apply a specific mathematical operation.
    • We aso don’t hear loudness on a linear scale. Generally to double the perceived volume of a sound we need to put 8 times as much energy into it. Therefore we take log of the amplitude. This is the decibel scale (The amplitude of a sound indicates the sound pressure level at a specific time and is measured in decibels (dB).).
    • A Log-Mel spectrogram is a spectrogram where the frequencies are converted to the Mel scale and amplitude into decibels.
    • Typically, in audio processing we divide the audio into 30-second segments, and for each segment, the Mel Spectrogram takes on a shape of (80, 3000), with 80 representing the number of Mel bins and 3000 as the sequence length. Through the transformation into a log-mel spectrogram, we not only decrease the volume of input data but also notably shorten the sequence length compared to the raw waveform. This log-mel spectrogram is subsequently passed through a feature enoder CNN as shown previously, resulting in a sequence of embeddings that can be fed into the transformer in the usual manner.
  • Mel-frequency Cepstral Coefficients (MFCC)
    • Another popular way of representing raw audio is to convert it to Mel-frequency cepstral coefficients (MFCC). MFCC is a very compressible representation, often using just 20 or 13 coefficients, they are a set of coefficients that capture the spectral characteristics of an audio signal.
    • MFCCs are computed as follows
      • Frame the signal into short frames: We frame the signal into 20 -40 ms (milli seconds) frames generally to get a reliable spectral estimate. Generally 25 ms is standard and a hop length of 10 ms is used. It means every 10 ms we will take 25 ms of audio. This allows overlap.
      • A 16khz signal will be divided into 160 samples (16000 samples for 1 sec; 160 samples for 10 ms) each containing 400 frames (16000*.25)
      • We compute the Discrete Fourier Transform (DFT) of each frame. This involves taking the absolute value of the complex Fourier transform and squaring the result, resulting in what is known as the Periodogram estimate of the power spectrum. Typically, a 512-point FFT is performed, retaining only the first 257 coefficients.
      • Calculate a Mel-spaced filterbank – This comprises 20 to 40 triangular filters (with 26 being the common choice) applied to the power spectral estimate derived in step 2. Each filterbank is represented as a vector of length 257 (assuming the FFT settings from step 2). Most of the vector’s values are zeros, but a specific section corresponds to non-zero values. To compute filterbank energies, we multiply each filterbank by the power spectrum and sum the coefficients, yielding 26 values that represent the energy distribution across the filterbanks. Take the natural logarithm (log) of each of the 26 energies obtained in step 3, resulting in 26 log filterbank energies.
      • We apply the Discrete Cosine Transform (DCT) to the 26 log filterbank energies, yielding 26 cepstral coefficients.
      • For speech-related tasks, it’s common to retain only the lower 12 to 13 coefficients out of the 26.
      • The resulting feature set, consisting of 12 numbers for each frame, is referred to as Mel Frequency Cepstral Coefficients (MFCCs)

We can pass Mel spectrogram or MFCC to a CNN Feature encoder which computes the embedding vector that is then consumed by transformer encoder. The 1d convolution is done along the time or sequence length. The Mel bins(80) or MFCC coefficients are taken as input channel for the 1d convolution

In both scenarios, whether dealing with waveform or spectrogram inputs, a common approach involves employing a compact neural network prior to utilizing the Transformer architecture. This initial network is responsible for transforming the input data into embeddings, after which the Transformer takes over to perform its specific task. This process ensures that relevant information is efficiently extracted from the input data before the Transformer works its magic.

Audio Transformer Outputs

The transformer architecture outputs a sequence of hidden-state vectors, also known as the output embeddings. Our goal is to transform these vectors into a text or audio output.

Text output

The objective of an automatic speech recognition model is to predict a sequence of textual tokens. To accomplish this, a language modeling head, typically comprising a single linear layer, is incorporated onto the transformer’s output. Subsequently, a softmax operation is applied to this head, enabling the prediction of probabilities for the various textual tokens within the vocabulary.

Spectrogram output

For models tasked with generating audio, such as a text-to-speech (TTS) model, additional layers must be incorporated to facilitate audio sequence generation. A common approach involves generating a spectrogram initially and subsequently utilizing an additional neural network, referred to as a vocoder, to transform this spectrogram into a waveform.

The output from the transformer network consists of a sequence of element vectors each of dimension d ( typically 768 ). A linear layer is employed to project this sequence into a log-Mel spectrogram. Following this, a post-net, composed of supplementary linear and convolutional layers, refines the spectrogram by mitigating noise. Finally, the vocoder takes this refined information to generate the ultimate audio waveform.

Generating Output through Decoder

Audio Transformer

Audio Transformer

The transformer architecture has the ability to process all the parts of input in parallel through its self-attention mechanism without the need to sequentially process them.

Type Of Audio Architectures

Audio architectures can be broadly classified into two types :


