paddlespeech asr 使用教程

安装
快速使用
指令详解
Server的配置文件application.yaml
- Server配置
- engine配置
  - asr_python参数介绍
  - asr_online参数介绍

我试了一下paddlespeech里面用的模型效果很好，但是本身缺少方便使用的教程。所以还是写一下，并分享出来，让这个工具使用的人更多些。

安装

paddle框架安装

conda install paddlepaddle==2.3.0 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/

软件源安装

pip install paddlespeech -i https://pypi.tuna.tsinghua.edu.cn/simple

源码安装

git clone https://github.com/PaddlePaddle/PaddleSpeech.git
cd PaddleSpeech
pip install . -i https://pypi.tuna.tsinghua.edu.cn/simple

快速使用

下载测试使用的音频

wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav 
wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav

非流式命令行接口（CLI）

使用默认模型

paddlespeech asr --input zh.wav

指定模型

paddlespeech asr --model conformer_online_wenetspeech --input zh.wav

非流式Server服务

切换路径进入speech_server目录

cd PaddleSpeech/demos/speech_server

启动服务

paddlespeech_server start --config_file ./conf/application.yaml

通过客户端程序访问

paddlespeech_client asr --server_ip 127.0.0.1 --port 8090 --input ./zh.wav

流式Server服务

切换路径进入streaming_asr_server目录

cd PaddleSpeech/demos/streaming_asr_server

启动服务

paddlespeech_server start --config_file ./conf/ws_conformer_wenetspeech_application.yaml

通过客户端程序访问

paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wav

指令详解

打印paddlespeech_server支持的命令

paddlespeech_server help

输出：

Usage:
paddlespeech_server <command> <options>

Commands:
help                   Show help for commands.
start                  Start the service
stats                  Get the models supported by each speech task in the service.

paddlespeech_server启动服务

只有两个可配置的参数--config_file和--log_file，分别指定了server应当加载的配置和产生log存放的位置。

paddlespeech_server start --config_file <path> --log_file <path>

paddlespeech_server查看支持的预训练模型

通过参数--task选择当前任务所支持的预训练模型

paddlespeech_server stats --task asr

输出

Here is the table of ASR pretrained models supported in the service.
+--------------------------------+----------+-------------+
|             Model              | Language | Sample Rate |
+--------------------------------+----------+-------------+
|     conformer_wenetspeech      |    zh    |     16k     |
|  conformer_online_wenetspeech  |    zh    |     16k     |
|    conformer_online_multicn    |    zh    |     16k     |
|       conformer_aishell        |    zh    |     16k     |
|    conformer_online_aishell    |    zh    |     16k     |
|    transformer_librispeech     |    en    |     16k     |
| deepspeech2online_wenetspeech  |    zh    |     16k     |
|   deepspeech2offline_aishell   |    zh    |     16k     |
|   deepspeech2online_aishell    |    zh    |     16k     |
| deepspeech2offline_librispeech |    en    |     16k     |
+--------------------------------+----------+-------------+
Here is the table of ASR static pretrained models supported in the service.
+----------------------------+----------+-------------+
|           Model            | Language | Sample Rate |
+----------------------------+----------+-------------+
| deepspeech2offline_aishell |    zh    |     16k     |
+----------------------------+----------+-------------+

可以看到模型asr的模型支持动态图模型和静态图模型，但是deepspeech2offline_aishell即属于动态又属于静态，看起来有些歧义了，并且这两个模型的md5码是相同的，具体是如何区分的，要更加详细的阅读源码了

Server的配置文件application.yaml

在配置文件application.yaml中分为两个部分，Server配置和engine配置。

Server配置

server配置的内容如下：

host: 0.0.0.0
port: 8090
protocol: 'websocket'
engine_list: ['asr_online']

host和port是服务器的ip地址和端口号
protocol是服务器支持的协议类型仅支持http和websocket
- http是非流式的语音服务，支持asr_python和asr_inference引擎，调用的语音识别引擎是和CLI模式一样的，相当于是CLI模式的服务器接口。
  - http://localhost:8090/paddlespeech/asr/help 支持get方法，仅返回一些简单的调用信息
  - http://localhost:8090/paddlespeech/asr 支持post方法，通过client的post方法，发送完整的音频流至服务器，然后服务器调用语音识别引擎完成语音识别，最后将结果返回
- websocket是流式语音服务，支持asr_online引擎。
  - ws://localhost:8090/paddlespeech/asr/streaming支持的是websoket的流式方式，根据输入的语音流实时给出语音识别的结果
engine_list是该Server所支持的引擎,可以是asr_python、asr_inference和asr_online中的一个，并且受到流式和非流式服务的限制。engine_list是一个列表所以它能配置多个engine，支持asr，tts，cl，text，vector服务同时运行。

协议	支持的引擎
http	asr_python, asr_inference tts_inference, tts_python cls_python, cls_inference text_python vector_python
websocket	asr_online tts_online tts_online-onnx

engine配置

asr_python参数介绍

asr_python:
    model: 'conformer_wenetspeech'
    lang: 'zh'
    sample_rate: 16000
    cfg_path: # [optional]
    ckpt_path: # [optional]
    decode_method: 'attention_rescoring'
    force_yes: True
    device:  # set 'gpu:id' or 'cpu'

cfg_path是模型的配置文件，ckpt_path为预训练模型的参数
当cfg_path为null或者ckpt_path为null，表示使用paddlespeech自带的预编译模型，模型由model、lang和sample_rate这三个参数共同决定，从已有的预编译库模型中选择，此时cfg_path为该预编译模型的model.yaml，ckpt_path为该预编译模型的参数文件*.pdparams
cfg_path和ckpt_path同时不为null时，则cfg_path指定自己训练模型的model.yaml，ckpt_path指定自己训练的参数。此时参数model，lang，sample_rate已无作用。
decode_method，有点难解释了，当输入的音频通过模型的encoder后，会对应的给出每一段音素所对应汉字的概率，decode_method就是处理这些音素的概率，生成最终的结果。
- attention：采用的是transformer的decoder的自回归方法，三言两语在这里讲不清，后面讲模型的时候在细聊吧
- ctc_greedy_search: 也叫贪心算法，就是取每个音素概率最大的汉字作为最终结果，通过CTC方式将相邻的并且相同的音素合并，得到最终结果。
- ctc_prefix_beam_search：束搜索,有点类似于动态规划，也有点像隐马尔克夫模型，与贪心算法不同选择最大概率的汉字，束搜索选择的是概率最大的10个（可以指定）汉字，当汉字与汉字组合的时候会产生大量的分支，通过对每个分支打分，选择分数最高10个分支，裁剪掉多余分支，这样可以将分支控制在一定的数量级，可以从结果中选出整句话与实际结果最相似的结果。
- attention_rescoring: 是ctc_prefix_beam_search与attention机制的结合，首先用ctc_prefix_beam_search获得前10个分数最高的预测分支，然后在用transformer进行重新打分，最后得出最终的结果。
force_yes: 当输入音频与模型的采样率不同时，需要强制的进行采样率变换，使得音频采样率和模型的一致
device：选择模型所在位置是cpu还是gpu

asr_online参数介绍

asr_online:
    model_type: 'conformer_online_wenetspeech'
    am_model:            # the pdmodel file of am static model [optional]
    am_params:           # the pdiparams file of am static model [optional]
    lang: 'zh'
    sample_rate: 16000
    cfg_path: 
    decode_method: 
    force_yes: True
    device: 'cpu'        # cpu or gpu:id
    decode_method: "attention_rescoring"
    am_predictor_conf:
        device:          # set 'gpu:id' or 'cpu'
        switch_ir_optim: True
        glog_info: False # True -> print glog
        summary: True    # False -> do not show predictor config

    chunk_buffer_conf:
        window_n: 7      # frame
        shift_n: 4       # frame
        window_ms: 25    # ms
        shift_ms: 10     # ms
        sample_rate: 16000
        sample_width: 2

本篇教程主要是针对conformer_online_wenetspeech这个模型，但配置文件中还包含与deepspeech2模型共用的参数，这里暂时不会细讲。

cfg_path是模型的配置文件，当模型是静态时，am_model和am_params分别是模型的配置和模型的参数，当模型是动态时，am_model和am_params用作模型的参数，和上一节ckpt_path作用相似，conformer_online_wenetspeech属于动态模型。
当cfg_path为null或am_model为null或am_params为null，表示使用paddlespeech自带的预编译模型，模型由model_type、lang和sample_rate这三个参数共同决定，从已有的预编译库模型中选择，此时cfg_path为该预编译模型的model.yaml，am_model和am_params为该预编译模型的参数文件*.pdparams
decode_method已经在asr_python中介绍过了，但在这里只支持ctc_prefix_beam_search和attention_rescoring，默认为attention_rescoring，这样处理的效果最好。
force_yes，device和decode_method和asr_python一致，略。
am_predictor_conf用于deepspeech2模型使用的，conformer模型不使用。
chunk_buffer_conf应该属于历史遗留的配置文件，代码里面没有引用的地方😂

posted @ 2022-05-22 09:07 chenkui164 阅读(8353) 评论(0) 编辑收藏举报

刷新页面返回顶部

chenkui164