Class 6 基于DNN-HMM的语音识别系统
GMM-HMM语音识别系统
建模训练
对于每一个语音序列先进行特征提取,得到每一个特征序列,再通过HMM-GMM建模。
对于每个状态有一个GMM模型,对于每个词有一个HMM模型,当一段语音输入后,根据Viterbi算法得到一个序列在GMM-HMM上的概率,然后通过Viterbi回溯得到每帧属于HMM的哪个状态(对齐)。
解码
对于任何一个特征序列,我们都可以通过GMM-HMM模型得到在任意时刻任意状态上的一个观测的概率,应用Veterbi算法结合HMM的转移概率和观测序列就可以进行解码。
问题:GMM对概率密度进行建模的话,也可以使用DNN来实现,本质上就是一个分类的模型。假设整个系统有如上图的特征向量和三个状态,我们就可以建立一个输出为3的神经网络,这里GMM就替换成了DNN。
DNN-HMM语音识别系统
建模训练
原始语音经过特征提取以后得到相应的13维的频谱特征(Fbank or MFCC),频谱特征经过神经网络DNN做一个状态上的分类模型,对于任何一段语音的训练数据来讲,我们有语音和其对应的label,对齐以后,我们可以得到这段语音对应状态的序列,然后我们就得到新的label和对应的原始语音,就可以通过DNN来训练了。
DNN三要素:
- 输入是什么?频谱特征
- 输出是什么?频谱特征所对应的状态label
- 损失函数是什么?分类问题CrossEntropy
然后,硬train一发 就可以了
解码
同上HMM-GMM模型,对于任何一个特征序列,我们都可以通过DNN-HMM模型得到在任意时刻任意状态上的一个观测的概率,应用Veterbi算法结合HMM的转移概率和观测序列就可以进行解码。
整体流程
Kaldi中AISHELL: egs/aishell/s5/run.sh
1.数据准备
# 数据下载
local/download_and_untar.sh $data $data_url data_aishell || exit 1;
local/download_and_untar.sh $data $data_url resource_aishell || exit 1;
# Lexicon Preparation,词典准备
local/aishell_prepare_dict.sh $data/resource_aishell || exit 1;
# Data Preparation,kaldi所需要的数据格式的准备
local/aishell_data_prep.sh $data/data_aishell/wav $data/data_aishell/transcript || exit 1;
# Phone Sets, questions, L compilation,准备音素集,问题集,?
utils/prepare_lang.sh --position-dependent-phones false data/local/dict \
"<SPOKEN_NOISE>" data/local/lang data/lang || exit 1;
# LM training,language model?
local/aishell_train_lms.sh || exit 1;
# G compilation, check LG composition,fft的知识?
utils/format_lm.sh data/lang data/local/lm/3gram-mincount/lm_unpruned.gz \
data/local/dict/lexicon.txt data/lang_test || exit 1;
2.特征提取
# Now make MFCC plus pitch features.
# mfccdir should be some place with a largish disk where you
# want to store MFCC features.
# 使用的是MFCC + 基频的特征
mfccdir=mfcc
for x in train dev test; do
steps/make_mfcc_pitch.sh --cmd "$train_cmd" --nj 10 data/$x exp/make_mfcc/$x $mfccdir || exit 1;
steps/compute_cmvn_stats.sh data/$x exp/make_mfcc/$x $mfccdir || exit 1;
utils/fix_data_dir.sh data/$x || exit 1;
done
3.单音素训练
# Train a monophone model on delta features.
steps/train_mono.sh --cmd "$train_cmd" --nj 10 \
data/train data/lang exp/mono || exit 1;
# Decode with the monophone model.解码相关,对单音素模型进行一个评估,计算其识别率
utils/mkgraph.sh data/lang_test exp/mono exp/mono/graph || exit 1;
steps/decode.sh --cmd "$decode_cmd" --config conf/decode.config --nj 10 \
exp/mono/graph data/dev exp/mono/decode_dev
steps/decode.sh --cmd "$decode_cmd" --config conf/decode.config --nj 10 \
exp/mono/graph data/test exp/mono/decode_test
# Get alignments from monophone system.使用单音素来做一个对齐操作,为下一步Veterbi做准备
steps/align_si.sh --cmd "$train_cmd" --nj 10 \
data/train data/lang exp/mono exp/mono_ali || exit 1;
4.三音素训练
#进行了多次训练,每一次都有训练+解码+对齐
# Train the first triphone pass model tri1 on delta + delta-delta features.
steps/train_deltas.sh --cmd "$train_cmd" \
2500 20000 data/train data/lang exp/mono_ali exp/tri1 || exit 1;
# decode tri1
utils/mkgraph.sh data/lang_test exp/tri1 exp/tri1/graph || exit 1;
steps/decode.sh --cmd "$decode_cmd" --config conf/decode.config --nj 10 \
exp/tri1/graph data/dev exp/tri1/decode_dev
steps/decode.sh --cmd "$decode_cmd" --config conf/decode.config --nj 10 \
exp/tri1/graph data/test exp/tri1/decode_test
# align tri1
steps/align_si.sh --cmd "$train_cmd" --nj 10 \
data/train data/lang exp/tri1 exp/tri1_ali || exit 1;
# train tri2 [delta+delta-deltas]
steps/train_deltas.sh --cmd "$train_cmd" \
2500 20000 data/train data/lang exp/tri1_ali exp/tri2 || exit 1;
# decode tri2
utils/mkgraph.sh data/lang_test exp/tri2 exp/tri2/graph
steps/decode.sh --cmd "$decode_cmd" --config conf/decode.config --nj 10 \
exp/tri2/graph data/dev exp/tri2/decode_dev
steps/decode.sh --cmd "$decode_cmd" --config conf/decode.config --nj 10 \
exp/tri2/graph data/test exp/tri2/decode_test
# Align training data with the tri2 model.
steps/align_si.sh --cmd "$train_cmd" --nj 10 \
data/train data/lang exp/tri2 exp/tri2_ali || exit 1;
# Train the second triphone pass model tri3a on LDA+MLLT features.
steps/train_lda_mllt.sh --cmd "$train_cmd" \
2500 20000 data/train data/lang exp/tri2_ali exp/tri3a || exit 1;
# Run a test decode with the tri3a model.
utils/mkgraph.sh data/lang_test exp/tri3a exp/tri3a/graph || exit 1;
steps/decode.sh --cmd "$decode_cmd" --nj 10 --config conf/decode.config \
exp/tri3a/graph data/dev exp/tri3a/decode_dev
steps/decode.sh --cmd "$decode_cmd" --nj 10 --config conf/decode.config \
exp/tri3a/graph data/test exp/tri3a/decode_test
# align tri3a with fMLLR
steps/align_fmllr.sh --cmd "$train_cmd" --nj 10 \
data/train data/lang exp/tri3a exp/tri3a_ali || exit 1;
# Train the third triphone pass model tri4a on LDA+MLLT+SAT features.
# From now on, we start building a more serious system with Speaker
# Adaptive Training (SAT).
steps/train_sat.sh --cmd "$train_cmd" \
2500 20000 data/train data/lang exp/tri3a_ali exp/tri4a || exit 1;
# decode tri4a
utils/mkgraph.sh data/lang_test exp/tri4a exp/tri4a/graph
steps/decode_fmllr.sh --cmd "$decode_cmd" --nj 10 --config conf/decode.config \
exp/tri4a/graph data/dev exp/tri4a/decode_dev
steps/decode_fmllr.sh --cmd "$decode_cmd" --nj 10 --config conf/decode.config \
exp/tri4a/graph data/test exp/tri4a/decode_test
# align tri4a with fMLLR
steps/align_fmllr.sh --cmd "$train_cmd" --nj 10 \
data/train data/lang exp/tri4a exp/tri4a_ali
# Train tri5a, which is LDA+MLLT+SAT
# Building a larger SAT system. You can see the num-leaves is 3500 and tot-gauss is 100000
steps/train_sat.sh --cmd "$train_cmd" \
3500 100000 data/train data/lang exp/tri4a_ali exp/tri5a || exit 1;
# decode tri5a
utils/mkgraph.sh data/lang_test exp/tri5a exp/tri5a/graph || exit 1;
steps/decode_fmllr.sh --cmd "$decode_cmd" --nj 10 --config conf/decode.config \
exp/tri5a/graph data/dev exp/tri5a/decode_dev || exit 1;
steps/decode_fmllr.sh --cmd "$decode_cmd" --nj 10 --config conf/decode.config \
exp/tri5a/graph data/test exp/tri5a/decode_test || exit 1;
# align tri5a with fMLLR
steps/align_fmllr.sh --cmd "$train_cmd" --nj 10 \
data/train data/lang exp/tri5a exp/tri5a_ali || exit 1;
5.DNN的训练
# nnet3,TDNN是DNN的一种变种,以tri5a的输出作为输入
local/nnet3/run_tdnn.sh
# chain
local/chain/run_tdnn.sh
深度神经网络
语音识别数据集
语音开源数据汇总:Open Speech and Language Resources
语音领域学术会议
- InterSpeech
- ICASSP
- ASRU
ASR性能指标
- 句错误率(SER,Sentence Error Rate)
$$
\text { Sentence Error Rate }=100 \times \frac{\ \text { of sentences with at least one word error }}{\text { total of sentences }}
$$
-
词错率(WER,Word error rate)
- Total Words 为(S替换+ D删除+ H正确)的字数
$$
\text { Word Error Rate }=100 \times \frac{\text { Insertions }+\text { Substitutions }+\text { Deletions }}{\text { Total Words in Correct Transcript }}
$$
- Total Words 为(S替换+ D删除+ H正确)的字数
-
字错误率(CER,Character Error Rate)
FNN(Feedforward Neural Network)
$$
y_{l}=f\left(\boldsymbol{W}{l} x+\boldsymbol{b}\right)
$$
激活函数
f是一种激活函数,常见的激活函数有:
损失函数
NN分类问题的损失函数:
- Softmax概率归一化
归一化是把数据压缩到[0,1],把量纲转为无量纲的过程,方便计算、比较。softmax是依赖于交叉熵设计的,因为交叉熵在计算的时候需要用到反向传播机制,反向传播对应的就是链式求导,如果用softmax的话就刚好和交叉熵求导的分母抵消掉,从而减小误差。
$$
y_{k}=\frac{\exp \left(a_{k}\right)}{\sum_{j=1}^{K} \exp \left(a_{j}\right)}
$$
- 交叉熵CE(Cross Entropy)损失函数
相对熵描述的是两个概率分布分布之间的差异,所以本应该是使用相对熵来计算真实分布与预测分布之间的差异。但是,相对熵 = 交叉熵 - 信息熵,而信息熵描述的是消除 p pp (即真实分布) 的不确定性所需信息量,是个固定值,因此优化相对熵可以简化为优化交叉熵,故机器学习中使用交叉熵作为损失函数
$$
L=-\sum_{n=1}^{N} \sum_{k=1}^{K} t_{n k} \ln \left(y_{n k}\right)
$$
梯度下降
$$
\begin{aligned}
&\theta=\left{W_{1}, b_{1}, W_{2}, b_{2}, \ldots, W_{N}, b_{N}\right} \
&\theta^{*}=\theta-\alpha \frac{\partial L}{\partial \theta}, \alpha \text { : 学习率 }
\end{aligned}
$$
反向传播(BackPropagation)
$$
链式求导法则: y=f(x), z=g(y), 则 \frac{d z}{d x}=\frac{d z}{d y} \frac{d y}{d x}
$$
训练流程
#model: NN model,such as DNN# theta: NN parameters
#lr: learning rate
init_model_with_parameter_theta(model,theta)
for epoch in range(max_epoch) :
for minibatch in data:
# Get minibatch data,include input feature and label
input,label = minibatch
output = model.forward(input)
loss = compute_loss(output,label)
delta = model.backward(loss)
theta = theta - lr * delta
其他NN必备知识
- Optimizer(SGD/Momentum/Adam …)
- Dropout
- Regularization(正则化)
- ResidualConnection
- BatchNormalization
CNN
Convolution(卷积,特征提取) + Pooling(池化层,降维)
TDNN
- 仅时域卷积,没有pooling
RNN(Recurrent Neural Network)
网络结构有记忆的功能,因此适合基于时序的建模。
LSTM (Long Short Term Memory)
RNN特别不稳定,容易出现梯度弥散或者梯度爆炸的问题,所以模型训练效果不好。
LSTMP(LSTM with projection)
该变种的主要目的是为了减小模型参数。与传统的LSTM相比,LSTMP在记忆块(memory blocks)与模型的输出层之间引入了映射层。通过加入了映射层,模型参数能得到有效减少。
混合神经网络(CLDNN)
- FNN
- 全局特征抽取
- CNN
- 局部特征抽取
- Invariance
- 有限时序建模能力
- RNN
- 记忆
- 时序建模能力
复杂网络基本是以上三种网络的组合