pytorch学习笔记4
pytorch学习笔记4
tokenizer【分词器】类的结构和数据
主要内容是用于分词的字符,包含一个字典,字典中包含一些特殊字符;
['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']常用的特殊字符一般是这5个,分表表示[未知字符],[分隔符],[填充符],[分类符],[掩码符];
Bert模型系列 vocab_file和tokenizer_file及其下载地址
vocab_file
bert-base-uncased:https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt
bert-large-uncased:https://huggingface.co/bert-large-uncased/resolve/main/vocab.txt
bert-base-cased:https://huggingface.co/bert-base-cased/resolve/main/vocab.txt
bert-large-cased:https://huggingface.co/bert-large-cased/resolve/main/vocab.txt
bert-base-multilingual-uncased:https://huggingface.co/bert-base-multilingual-uncased/resolve/main/vocab.txt
bert-base-multilingual-cased:https://huggingface.co/bert-base-multilingual-cased/resolve/main/vocab.txt
bert-base-chinese:https://huggingface.co/bert-base-chinese/resolve/main/vocab.txt
bert-base-german-cased:https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt
bert-large-uncased-whole-word-masking:https://huggingface.co/bert-large-uncased-whole-word-masking/resolve/main/vocab.txt
bert-large-cased-whole-word-masking:https://huggingface.co/bert-large-cased-whole-word-masking/resolve/main/vocab.txt
bert-large-uncased-whole-word-masking-finetuned-squad:https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/vocab.txt
bert-large-cased-whole-word-masking-finetuned-squad:https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/vocab.txt
bert-base-cased-finetuned-mrpc:https://huggingface.co/bert-base-cased-finetuned-mrpc/resolve/main/vocab.txt
bert-base-german-dbmdz-cased:https://huggingface.co/bert-base-german-dbmdz-cased/resolve/main/vocab.txt
bert-base-german-dbmdz-uncased:https://huggingface.co/bert-base-german-dbmdz-uncased/resolve/main/vocab.txt
TurkuNLP/bert-base-finnish-cased-v1:https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1/resolve/main/vocab.txt
TurkuNLP/bert-base-finnish-uncased-v1:https://huggingface.co/TurkuNLP/bert-base-finnish-uncased-v1/resolve/main/vocab.txt
wietsedv/bert-base-dutch-cased:https://huggingface.co/wietsedv/bert-base-dutch-cased/resolve/main/vocab.txt
tokenizer_file
bert-base-uncased:https://huggingface.co/bert-base-uncased/resolve/main/tokenizer.json
bert-large-uncased:https://huggingface.co/bert-large-uncased/resolve/main/tokenizer.json
bert-base-cased:https://huggingface.co/bert-base-cased/resolve/main/tokenizer.json
bert-large-cased:https://huggingface.co/bert-large-cased/resolve/main/tokenizer.json
bert-base-multilingual-uncased:https://huggingface.co/bert-base-multilingual-uncased/resolve/main/tokenizer.json
bert-base-multilingual-cased:https://huggingface.co/bert-base-multilingual-cased/resolve/main/tokenizer.json
bert-base-chinese:https://huggingface.co/bert-base-chinese/resolve/main/tokenizer.json
bert-base-german-cased:https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-tokenizer.json
bert-large-uncased-whole-word-masking:https://huggingface.co/bert-large-uncased-whole-word-masking/resolve/main/tokenizer.json
bert-large-cased-whole-word-masking:https://huggingface.co/bert-large-cased-whole-word-masking/resolve/main/tokenizer.json
bert-large-uncased-whole-word-masking-finetuned-squad:https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/tokenizer.json
bert-large-cased-whole-word-masking-finetuned-squad:https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/tokenizer.json
bert-base-cased-finetuned-mrpc:https://huggingface.co/bert-base-cased-finetuned-mrpc/resolve/main/tokenizer.json
bert-base-german-dbmdz-cased:https://huggingface.co/bert-base-german-dbmdz-cased/resolve/main/tokenizer.json
bert-base-german-dbmdz-uncased:https://huggingface.co/bert-base-german-dbmdz-uncased/resolve/main/tokenizer.json
TurkuNLP/bert-base-finnish-cased-v1:https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1/resolve/main/tokenizer.json
TurkuNLP/bert-base-finnish-uncased-v1:https://huggingface.co/TurkuNLP/bert-base-finnish-uncased-v1/resolve/main/tokenizer.json
wietsedv/bert-base-dutch-cased:https://huggingface.co/wietsedv/bert-base-dutch-cased/resolve/main/tokenizer.json
#end