HuggingFace
Pipeline
pipeline 模块把所有东西都封装死了,只需要传进去原始输入,就能得到输出.
例:遮掩词填空,可以看出 pipeline function 给出了 5 个备选答案
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | from transformers import pipeline classifier = pipeline( "fill-mask" ) y_pred = classifier( "I love <mask> very much." ) print (y_pred) """ [ {'score': 0.09382506459951401, 'token': 123, 'token_str': ' him', 'sequence': 'I love him very much.'}, {'score': 0.06408175826072693, 'token': 47, 'token_str': ' you', 'sequence': 'I love you very much.'}, {'score': 0.056255027651786804, 'token': 69, 'token_str': ' her', 'sequence': 'I love her very much.'}, {'score': 0.017606642097234726, 'token': 106, 'token_str': ' them', 'sequence': 'I love them very much.'}, {'score': 0.016162296757102013, 'token': 24, 'token_str': ' it', 'sequence': 'I love it very much.'} ] """ |
Tokenizer
tokenizer 是分词器,对输入的单词进行预处理,可能会将单词拆开(例如,dogs 拆成 dog + s)
一般来说,tokenizer 的处理结果和后面的大模型应当是配套的(显然,不同大模型有不同的拆分方案)
一般来说,会有 input_ids 和 attention_mask 这两项,前面的 input_ids 就是拆分后词在语料库中的编号,然后后面 attention_mask 为 0 代表着没东西(是被 padding 的位置)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | from transformers import pipeline from transformers import AutoTokenizer checkpoint = "distilbert-base-uncased-finetuned-sst-2-English" tokenizer = AutoTokenizer.from_pretrained(checkpoint) raw_inputs = [ "I've been waiting this for a lifetime!" , "I love Tom Brady." ] tokenized_inputs = tokenizer(raw_inputs, padding = True ) print (tokenized_inputs) """ { 'input_ids': [ [101, 1045, 1005, 2310, 2042, 3403, 2023, 2005, 1037, 6480, 999, 102], [101, 1045, 2293, 3419, 10184, 1012, 102, 0, 0, 0, 0, 0] ], 'attention_mask': [ [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0] ] } """ |
Model
输入的句子经过 tokenizer 的预处理后就可以喂给 model(真正的大模型) 了.
model 的输出是未经过标准化/激活函数的向量,所以说想要得到最后的结果还需要自己写一下.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | from transformers import pipeline from transformers import AutoTokenizer from transformers import AutoModel checkpoint = "distilbert-base-uncased-finetuned-sst-2-English" tokenizer = AutoTokenizer.from_pretrained(checkpoint) raw_inputs = [ "I've been waiting this for a lifetime!" , "I love Tom Brady." ] tokenized_inputs = tokenizer(raw_inputs, padding = True , return_tensors = "pt" ) model = AutoModel.from_pretrained(checkpoint) outputs = model( * * tokenized_inputs) print (outputs.last_hidden_state.shape) # torch.Size([2, 12, 768]) |
例如,可以针对单词分类这个任务写一个 softmax :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | from transformers import pipeline from transformers import AutoTokenizer from transformers import AutoModel, AutoModelForTokenClassification import torch checkpoint = "distilbert-base-uncased-finetuned-sst-2-English" tokenizer = AutoTokenizer.from_pretrained(checkpoint) raw_inputs = [ "I've been waiting this for a lifetime!" , "I love Tom Brady." ] tokenized_inputs = tokenizer(raw_inputs, padding = True , return_tensors = "pt" ) model = AutoModelForTokenClassification.from_pretrained(checkpoint) outputs = model( * * tokenized_inputs) print (outputs.logits) predictions = torch.nn.functional.softmax(outputs.logits, dim = - 1 ) print (predictions) |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 分享4款.NET开源、免费、实用的商城系统
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· 上周热点回顾(2.24-3.2)