torchtext支持的分词器

torchtext是pytorch自带的关于文本的处理工具。

torchtext支持的分词器

from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer('basic_english')

在/Users/xuehuiping/anaconda3/envs/my_transformer/lib/python3.7/site-packages/torchtext/data/utils.py查看get_tokenizer的定义:

def get_tokenizer(tokenizer, language='en')

tokenizer可以是:

tokenizer取值 分词说明
None 无效
basic_english language只能是en
spacy spacy = spacy.load(language)
moses from sacremoses import MosesTokenizer
moses_tokenizer = MosesTokenizer()
return moses_tokenizer.tokenize
toktok from nltk.tokenize.toktok import ToktokTokenizer
toktok = ToktokTokenizer()
return toktok.tokenize
revtok import revtok
return revtok.tokenize
subword import revtok
return partial(revtok.tokenize, decap=True)
posted on 2021-09-27 15:36  宋岳庭  阅读(533)  评论(0编辑  收藏  举报