Gabor Filterbank (GFB) Features
GFB is a recent feature designed for robust ASR by taking into account the spectrotemporal modulation frequencies.
To derive GFB, we compute the log mel-spectrum from an input signal.
The spectrum is filtered by a Gabor filterbank which consists of 41 carefully designed Gabor filters.
Representative channels of each filtered spectrum are selected and concatenated to form 311-D GFB.
Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition
Gabor Filter Bank (GBFB) Features
Calculation of the GBFB features
An overview of the feature extraction scheme with the Gabor filter bank process is illustrated in Fig.
Fig. Illustration of the Gabor filter bank feature extraction. The input log Mel-spectrogram is filtered with each of the 41 filters of the Gabor filter bank. An example filter output is shown. The representative channels of this filter output are selected and **concatenated** with the representative channels of the other 40 Gabor filters. The resulting 311-dimensional output is used as feature vector.First, a Mel-spectrogram is calculated from the speech signal using an implementation of the ETSI. This standard defines the calculation consists of 23 frequency channels with center frequencies in the range from 124 Hz to 3657 Hz. The calculation is based on frames of 25 ms length, while the temporal resolution is 100 frames/s. The absolute output values of the spectrogram are compressed with the logarithm, roughly resembling the amplitude compression performed by the auditory system.
Then the spectrogram is processed with the filters from the GBFB, by calculating the two-dimensional convolution of the spectrogram and the filter. This results in a time-frequency representation that contains patterns matching the modulation frequencies associated with a specific filter.
The filtering process is illustrated in Fig, which shows the original spectrogram, a sample filter, and the filter output.
Gabor filter bank
The localized complex Gabor filters are defined in Eq., with the channel and time-frame variables \(k\) and \(n\);
\(k_{0}\) denoting the central frequency channel; \(n_{0}\) the central time frame;
\(\omega_{k}\) the spectral modulation frequency; \(\omega_{n}\) the temporal modulation frequency;
\(v_{k}\) the number of semi-cycles under the envelope in spectral dimension; \(v_{n}\) the number of semi-cycles under the envelope in temporal dimension;
\(\phi\) an additional gloabl phase
For purely temporal and purely spectral modulation filters (\(\omega_n=0\) or \(\omega_k=0\)) this definition results in filter functions with infinite support. For that reason the filter size of all filters in limited to 69 channels and 40 time frames. These limits correspond roughly to the maximum size of the spectro-temporal filters in the respective dimensions.
Due to the linear relation between the modulation frequency and the extension of the envelope, all filters with identical values for \(v_{k}\) and \(v_{n}\) are constant-Q filters.
Since relative energy fluctuations are of special interest for the classification of speech, the DC bias of each filter is removed. This is achieved by subtracting a normalized version of the filter's envelope function from the filter function, so that their DC values cancel each other out.
The effect of the DC removal is that the resulting representation is independent of the global singal energy. Since a removal of the mean on a logarithmic energy scale is the same as dividing by it on a linear scale, this corresponds to a normalization.
DC-free Gabor filters naturally normalize in both spectrally and temporally.