sentencepiece 学习笔记
简介
最近在看 speechbrain 语音识别项目,其中第一步就是对文本标签进行 tokenization 了,各种参数看得云里雾里的,现在系统
总结 googel的 sentencepiece 的使用。
参考:https://github.com/google/sentencepiece
一、安装
pip install sentencepiece
二、支持的切词方法
三、python 接口的使用
import sentencepiece as spm # Model Training ''' --input: one-sentence-per-line raw corpus file. No need to run tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes the input with Unicode NFKC. You can pass a comma-separated list of files. --model_prefix: output model name prefix. <model_name>.model and <model_name>.vocab are generated. --vocab_size: vocabulary size, e.g., 8000, 16000, or 32000 --character_coverage: amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set. --model_type: model type. Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type. ''' # 一些特殊字符的处理 ''' 1. By default, SentencePiece uses Unknown (<unk>), BOS (<s>) and EOS (</s>) tokens which have the ids of 0, 1, and 2 respectively 2. We can redefine this mapping in the training phase as follows. -bos_id=0 --eos_id=1 --unk_id=5 3. When setting -1 id e.g., bos_id=-1, this special token is disabled. Note that the unknow id cannot be disabled. We can define an id for padding (<pad>) as --pad_id=3. ''' spm.SentencePieceTrainer.Train(input='botchan.txt', model_prefix='m', model_type="unigram", vocab_size=1000) # 在当前目录下生成 m.model 和 m.vocab 文件 # 加载训练好的模型,切分文本 sp = spm.SentencePieceProcessor(model_file='m.model') # 编码 text -> id result = sp.encode(['This is a test', 'Hello world'], out_type=int) print(result) result = sp.encode(['This is a test', 'Hello world'], out_type=str) print(result) # 解码 id -> text result = sp.decode([285, 46, 10, 170, 382]) print(result) result = sp.decode(['▁This', '▁is', '▁a', '▁t', 'est']) print(result) # 采样 for _ in range(10): result = sp.encode('This is a test', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1) print(result) # 其它常用方法 sp.get_piece_size() sp.id_to_piece(2) sp.id_to_piece([2, 3, 4]) sp.piece_to_id('<s>') sp.piece_to_id(['</s>', '\r', '▁'])
【推荐】FFA 2024大会视频回放:Apache Flink 的过去、现在及未来
【推荐】中国电信天翼云云端翼购节,2核2G云服务器一口价38元/年
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· [杂谈]如何选择:Session 还是 JWT?
· 硬盘空间消失之谜:Linux 服务器存储排查与优化全过程
· JavaScript是按顺序执行的吗?聊聊JavaScript中的变量提升
· [杂谈]后台日志该怎么打印
· Pascal 架构 GPU 在 vllm下的模型推理优化
· 面试官:DNS解析都整不明白,敢说你懂网络?我:嘤嘤嘤!
· 2000 Star,是时候为我的开源项目更新下功能了
· [WPF UI] 为 AvalonDock 制作一套 Fluent UI 主题
· 基于.NET WinForm开发的一款硬件及协议通讯工具
· 内网穿透之http代理服务器