The Conventional CNN-based Method

To be specific, given an audio clip, the two-dimensional time-frequency representation (e.g. Log-Mel) is first extracted. Convolutional layers are then applied to the time-frequency representation \(\boldsymbol{M} \in \mathbb{R}^{T \times F}\) to obtain the deep representation \(\boldsymbol{M}^{'} \in \mathbb{R}^{c \times t \times f}\), where \(c\) denotes the number of the output channels.

\[\boldsymbol{M}^{'}=f_{cnn}(\boldsymbol{M};\theta_{cnn}) \]

Here, \(f_{nn}\) denotes the operation of the convolutional layers and \(\theta_{cnn}\) denotes the model parameters of the convolutional layers. The global pooling layer and fully-connected layers are then applied to obtain the predicted score of the classification.
Let \(f_{gp}\), \(f_{fc}\) be the operations of the global pooling layer and the fully-connected layers, respectively. The predicted score \(\hat{\bold{y}}\in\mathbb{R}^{N}\) (where \(N\) denotes the number of categories) can be obtained by

\[\hat{\bold{y}}=f_{fc}(f_{gp}(\boldsymbol{M}^{'});\theta_{fc}) \]

where \(\theta_{fc}\) denotes the model parameters of the fully-connected layers.

posted @ 2023-01-04 14:00  prettysky  阅读(16)  评论(0编辑  收藏  举报