2014, Robust CNN-based Speech Recognition With Gabor Filter Kernels

在这里，我们报告了集成预定义的Gabor滤波器和训练有素的卷积神经网络来生成一个更健壮的特征，称为GCNN。典型的CNN架构使用共享权重来过滤接受域，建模频谱的局部特征。这个过滤过程允许我们将2D Gabor滤波器集成到CNN拓扑中。我们对CNN的接受野进行了修改，采用了几个符合Gabor滤波特征的时间和频率支持。改进后的CNN将Gabor滤波器系数作为最低层的初始滤波器，并通过反向传播训练进行微调以优化系数。在实验中，提出的GCNN特征比Gabor- dnn和CNN特征表现得更好，前者保持Gabor系数未训练，而后者使用训练过的滤波器而没有Gabor建模。此外，池化算法有效地降低了含噪语音识别的错误率。

Proposed Method

Power-Normalized Spectrum

Gabor滤波器或cnn滤波器都实现了局部频率特征滤波器，因此采用短期功率谱的某个函数作为输入。尽管在许多实验中已经成功地使用了梅尔谱，但在噪声的存在下，它很容易被破坏。在本文中，使用基于PNCC的特征生成算法，从一种更健壮的谱-时域表示中生成特征，称为功率归一化谱(PNS)[16]。与梅尔频谱不同，短期频谱是使用等效矩形带宽(ERB)尺度上等距的伽玛通听觉滤波器集成的。其次，减去中频功率偏置，其中偏置电平的计算基于中频功率的算术平均值和几何平均值(AM/GM比)之比，这是由于AM/GM比[16]降低噪声功率所致。最后，一个指数为0.1的幂非线性取代了mel cepstra中用于压缩的对数非线性。

Gabor features

Convolutional Neural Network

Convolutional Neural Network using Gabor filter

Experimental Setup

The proposed approach was evaluated using two noisy versions of WSJ:
(1) Aurora 4 and (2) RATS "re-noised" Wall Street Journal (WSJ) speech.

The Aurora 4 dataset provides both a clean training set and a multi-condition training set.
The clean training set is taken from 7138 utterances of WSJ0 SI-84 dataset (83 speakers) where the data was recorded using a Sennheiser microphone.
The multi-condition training set contains the same number of utterances as the clean training set, while half of the utterances were recorded by a secondary microphone.
Six noise types (car, babble, restaurant, street, airport and train) at SNRs between 10dB and 20dB were randomly added to three-fourths of utterances from both microphone types.
The evaluation set is based on 166 utterances of Nov'92 5k evaluation set (8 speakers), and is composed of 14 subsets: clean and 6 noise corrupted sets for data recorded by both microphone types. The noise types are the same as those used for the multicondition training set, but were chosen with an SNR between 5 and 15 dB.
The 14 subsets are grouped into 4 sets: clean, noisy, clean with microphone distortion and noisy with microphone distortion, which are referred as A, B, C and D respectively.

For "RATS re-noised WSJ", we started out with data taken from WSJ1 dataset (284 speakers) for training and the WSJ-eval94 dataset (20 speakers) for testing.
Estimated additive and channel noise from degraded recordings was applied to both training and testing dataset using the "renoiser" tool.[24]
[24] “Renoiser web page,” http://labrosa.ee.columbia.edu/projects/renoiser/create_wsj.html
Designed for use in the DARPA RATS project, the system analyzes data from RATS rebroadcast example signals (in this case, LDC2011E20) to estimate the noise characteristic including SNRs and frequency-shifts; the original data is described in [25] and consists of a variety of continuous speech sources that have been transmitteed and received over 8 different radio channels, resulting in significant signal degradations. The 8 radio channel characteristics are specified in Table 1.

We applied the same noise characteristics to WSJ data to generate the "RATS re-noised WSJ". In this case, the training data was obtained from 51.2 hours of WSJ1 dataset with clean channel and channel G (the channel with highest SNR). Testing data was 0.8 hours of WSJ-eval94 for each channel. The results reported here are WERs averaging over clean and 8 noisy channels.

For both Aurora 4 and RATS re-noised WSJ, the acoustic models used cross-word triphones estimated with maximum likelihood. The resulting triphone states were clustered to 2500 tied states, each of which was modeled by 16 components of a Gaussian mixture model. We used version 0.6 of CMU pronunciation dictionary and the standard 5k bigram language model created at Lincoln Labs for the 1992 evaluation.

Unless otherwise specified, mean normalization was performed for the features, while vocal tract length normalization (VTLN) and adaption techniques such as maximum likelihood linear regression (MLLR) were not employed for these tests.

The fully connected deep neural networks were trained with a 6-layer bottleneck structure with a bottleneck (25 units) in the fifth hidden layer. The output layer consisted of 41 context-independent phonetic targets.
39-d cepstral coefficients or 814-d Gabor features with 9 successive frames were used as input for a fully connected deep neural network.

Restricted Boltzman machine (RBM) pre-training was employed to initialize the parameters of the neural network. For back propagation following the pre-training, we began with a learning rate of .008 and reduced the learning rate by factors of two once cross-validation indicated limited progresses with each learning rate, and continued until corss-validation showed essentially no further progress.

For the CNN topology, we used 120 filters for the convolutional layer. The filter size was 9 frequency bands with 15 successive frames.
We used a pooling size of 6 convolutional bands with stride 2 (overlap by 4), which reduced dimensionality by a factor of 2.
This layer was fed to a 5-layer fully connected bottleneck structure.

In the GCNN architecture, the time support for each filter kernel ranges from 7 to 99 frames, and frequency support ranges from 7 to 40 bands.
59 of the filters were initialized as Gabor filter coefficients, and the other 61 filters were randomly initialized.
The rest of network set up is the same as for the CNN.

For both CNN and GCNN architecture, 40-d power normalized spectrum was used as input. We didn't use delta and acceleration coefficients to be consistent with Gabor filter inpute.

Back propagation strategy is the same as used for the DNN, while no pre-training was performed.

For fair comparison, the number of free parameters of the neural networks were constrained to roughly 3.5M by controlling the hidden layer size.

MFCCs were concatenated with the fully connected deep neural network or convolutional neural network trained features, resulting in a 64-d feature vector.

Also, means and variances were normalized per utterance before HMM training and testing for all the features described here.

Results and Discussion

We first present a series of baseline results for RATS renoised WSJ and Aurora 4 results using the clean training set.

In Table 3, we compare a series of trained features using fully connected neural network or CNN.

First of all, the trained features of Table 3 are better than untrained features of Table 2. In Table 3, Gabor-DNN was better than PNCC-DNN except for the clean set (A). - Next, we compare Gabor-DNN with PNS-CNN and PNS-GCNN without pooling layer. Without pooling, the inputs of fully connected network are feature maps of convolutional layer. Therefore, these rows functions as a comparison between different sets of spectro-temporal filters.
- Gabor-DNN used filters with variable size, but totally handcrafted.
- PNS-CNN, on the other hand, learned filter with fixed size and trained on limited data.
- PNS-GCNN has filters with variable size. Also, the trained filters were initialized with handcrafted filters.
  In Table 3, PNS-GCNN was better than the other two features, although the differences are small.

The larger effects visible in the table show the effects of pooling, and the cumulative effects of pooling and using GCNN instead of CNN. In particular, max pooling provides a significant improvement for both RATS WSJ and Aurora 4 (especially for noisy set (B) and noisy set with channel distortion (D)), and particularly with pooling, using Gabor filters to help design the CNN has a good effect.

The trained CNN filters are composed of several vertical (spectral) and horizontal (temporal) filters, as the examples in Fig. 4 show. However, the filters have very low correlation to the diagonal Gabor filter, such as the diagonal filter in Fig. 5(left) while the diagonal filter would be kept and tuned in GCNN topology and the final filter is shown in Fig. 5(right). Thus, diagonal filtering is another factor distinguishing trained filters with Gabor initialization and those with rando initialization.

In addition to the experiments with mismatched training and testing, we also used the multi-condition training set for Aurora 4. We choose the distinguished features, Gabor-DNN, PNS-CNN and PNS-GCNN comparing with two baselines ETSI-AFE and PNCC.

The results are shown in Table 4, where the proposed PNS-GCNN could achieve 16.6% WER. This was achieved without VTLN,MLLR,or other modeling enhancements.

posted @ 2023-02-23 21:18 prettysky 阅读(84) 评论(0) 收藏举报

刷新页面返回顶部

prettysky

Concentrate on A!