funasr
funasr
https://www.funasr.com/#/
https://github.com/modelscope/FunASR
A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
FunASR hopes to build a bridge between academic research and industrial applications on speech recognition. By supporting the training & finetuning of the industrial-grade speech recognition model, researchers and developers can conduct research and production of speech recognition models more conveniently, and promote the development of speech recognition ecology. ASR for Fun!
- FunASR is a fundamental speech recognition toolkit that offers a variety of features, including speech recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization and multi-talker ASR. FunASR provides convenient scripts and tutorials, supporting inference and fine-tuning of pre-trained models.
- We have released a vast collection of academic and industrial pretrained models on the ModelScope and huggingface, which can be accessed through our Model Zoo. The representative Paraformer-large, a non-autoregressive end-to-end speech recognition model, has the advantages of high accuracy, high efficiency, and convenient deployment, supporting the rapid construction of speech recognition services. For more details on service deployment, please refer to the service deployment document.
https://cloud.baidu.com/article/3347080
3. 多语言支持
随着全球化的推进,多语言支持已成为语音识别技术的必备功能。FunASR支持中文、英文、日文等多种主流语言,并可根据用户需求进行定制开发,满足不同国家和地区的语音识别需求。这一特性使得FunASR在跨国企业、国际交流等领域具有广泛的应用前景。
demo
https://www.funasr.com/static/offline/index.html
tutourial
https://www.cnblogs.com/LaoDie1/p/18183024
https://www.cnblogs.com/v3ucn/p/17956926
VAD
https://blog.ailemon.net/2021/02/18/introduction-to-vad-theory/
首先我们来明确一下基本概念,语音激活检测(VAD, Voice Activation Detection)算法主要是用来检测当前声音信号中是否存在人的话音信号的。该算法通过对输入信号进行判断,将话音信号片段与各种背景噪声信号片段区分出来,使得我们能够分别对两种信号采用不同的处理方法。
VAD有很多种特征提取方法,一种最简单直接的是:通过短时能量(short time energy, STE)和短时过零率(zero cross counter, ZCC) 来测定,即基于能量的特征。短时能量就是一帧语音信号的能量,过零率就是一帧语音的时域信号穿过0(时间轴)的次数。一般来说,精确度高的VAD会提取基于能量的特征、频域特征、倒谱特征、谐波特征、长时信息特征等多个特征进行判断[1]。最后我们再根据阈值进行比较,或者使用统计的方法和机器学习的方法,得出是语音信号还是非语音信号的结论。