The Conventional CNN-based Method
To be specific, given an audio clip, the two-dimensional time-frequency representation (e.g. Log-Mel) is first extracted. Convolutional layers are then applied to the time-frequency representation \(\boldsymbol{M} \in \mathbb{R}^{T \times F}\) to obtain the deep representation \(\boldsymbol{M}^{'} \in \mathbb{R}^{c \times t \times f}\), where \(c\) denotes the number of the output channels.
Here, \(f_{nn}\) denotes the operation of the convolutional layers and \(\theta_{cnn}\) denotes the model parameters of the convolutional layers. The global pooling layer and fully-connected layers are then applied to obtain the predicted score of the classification.
Let \(f_{gp}\), \(f_{fc}\) be the operations of the global pooling layer and the fully-connected layers, respectively. The predicted score \(\hat{\bold{y}}\in\mathbb{R}^{N}\) (where \(N\) denotes the number of categories) can be obtained by
where \(\theta_{fc}\) denotes the model parameters of the fully-connected layers.