张德长

导航

< 2025年3月 >
23 24 25 26 27 28 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5

统计

pytorch学习笔记4

pytorch学习笔记4

 

 

 

 

tokenizer【分词器】类的结构和数据

主要内容是用于分词的字符,包含一个字典,字典中包含一些特殊字符;

['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']常用的特殊字符一般是这5个,分表表示[未知字符],[分隔符],[填充符],[分类符],[掩码符];

Bert模型系列 vocab_file和tokenizer_file及其下载地址

vocab_file
bert-base-uncased:https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt
bert-large-uncased:https://huggingface.co/bert-large-uncased/resolve/main/vocab.txt
bert-base-cased:https://huggingface.co/bert-base-cased/resolve/main/vocab.txt
bert-large-cased:https://huggingface.co/bert-large-cased/resolve/main/vocab.txt
bert-base-multilingual-uncased:https://huggingface.co/bert-base-multilingual-uncased/resolve/main/vocab.txt
bert-base-multilingual-cased:https://huggingface.co/bert-base-multilingual-cased/resolve/main/vocab.txt
bert-base-chinese:https://huggingface.co/bert-base-chinese/resolve/main/vocab.txt
bert-base-german-cased:https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt
bert-large-uncased-whole-word-masking:https://huggingface.co/bert-large-uncased-whole-word-masking/resolve/main/vocab.txt
bert-large-cased-whole-word-masking:https://huggingface.co/bert-large-cased-whole-word-masking/resolve/main/vocab.txt
bert-large-uncased-whole-word-masking-finetuned-squad:https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/vocab.txt
bert-large-cased-whole-word-masking-finetuned-squad:https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/vocab.txt
bert-base-cased-finetuned-mrpc:https://huggingface.co/bert-base-cased-finetuned-mrpc/resolve/main/vocab.txt
bert-base-german-dbmdz-cased:https://huggingface.co/bert-base-german-dbmdz-cased/resolve/main/vocab.txt
bert-base-german-dbmdz-uncased:https://huggingface.co/bert-base-german-dbmdz-uncased/resolve/main/vocab.txt
TurkuNLP/bert-base-finnish-cased-v1:https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1/resolve/main/vocab.txt
TurkuNLP/bert-base-finnish-uncased-v1:https://huggingface.co/TurkuNLP/bert-base-finnish-uncased-v1/resolve/main/vocab.txt
wietsedv/bert-base-dutch-cased:https://huggingface.co/wietsedv/bert-base-dutch-cased/resolve/main/vocab.txt
tokenizer_file
bert-base-uncased:https://huggingface.co/bert-base-uncased/resolve/main/tokenizer.json
bert-large-uncased:https://huggingface.co/bert-large-uncased/resolve/main/tokenizer.json
bert-base-cased:https://huggingface.co/bert-base-cased/resolve/main/tokenizer.json
bert-large-cased:https://huggingface.co/bert-large-cased/resolve/main/tokenizer.json
bert-base-multilingual-uncased:https://huggingface.co/bert-base-multilingual-uncased/resolve/main/tokenizer.json
bert-base-multilingual-cased:https://huggingface.co/bert-base-multilingual-cased/resolve/main/tokenizer.json
bert-base-chinese:https://huggingface.co/bert-base-chinese/resolve/main/tokenizer.json
bert-base-german-cased:https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-tokenizer.json
bert-large-uncased-whole-word-masking:https://huggingface.co/bert-large-uncased-whole-word-masking/resolve/main/tokenizer.json
bert-large-cased-whole-word-masking:https://huggingface.co/bert-large-cased-whole-word-masking/resolve/main/tokenizer.json
bert-large-uncased-whole-word-masking-finetuned-squad:https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/tokenizer.json
bert-large-cased-whole-word-masking-finetuned-squad:https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/tokenizer.json
bert-base-cased-finetuned-mrpc:https://huggingface.co/bert-base-cased-finetuned-mrpc/resolve/main/tokenizer.json
bert-base-german-dbmdz-cased:https://huggingface.co/bert-base-german-dbmdz-cased/resolve/main/tokenizer.json
bert-base-german-dbmdz-uncased:https://huggingface.co/bert-base-german-dbmdz-uncased/resolve/main/tokenizer.json
TurkuNLP/bert-base-finnish-cased-v1:https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1/resolve/main/tokenizer.json
TurkuNLP/bert-base-finnish-uncased-v1:https://huggingface.co/TurkuNLP/bert-base-finnish-uncased-v1/resolve/main/tokenizer.json
wietsedv/bert-base-dutch-cased:https://huggingface.co/wietsedv/bert-base-dutch-cased/resolve/main/tokenizer.json

#end

posted on   张德长  阅读(79)  评论(0编辑  收藏  举报

相关博文:
阅读排行:
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 单元测试从入门到精通
· 上周热点回顾(3.3-3.9)
· winform 绘制太阳,地球,月球 运作规律
点击右上角即可分享
微信分享提示