2020,Robust acoustic scene classification using a multi-spectrogram encoder-decoder framework
DOI:10.1016/j.dsp.2020.102943
paper
multi-spectrogram:
- log-Mel spectrogram (log-Mel)
STFT spectra \(S(f,t)=\sum_{n=0}^{N-1}x_{t}[n]w[n]e^{-i2\pi n f/f_{s}}\)
Mel frequency warping \(f_{mel}=2595 log(1+f/700)\)
simulates the overall frequency selectivity of the human auditory system - Gammatonegram (Gamma)
STFT spectra
Gammatone weighting by \(g(t)=t^{P-1}e^{-2bt\pi}cos(2ft\pi+\theta)\)
model the frequency-selective cochlea activation response of the human inner ear - Constant Q Transform (CQT)
model the geometric relationship of pitch,
which makes it likely to be effective when undertaking a comparison between natural and artificial sounds, as well as being suitable for frequencies that span several octaves