Transformers--4-37-中文文档-三-

Transformers 4.37 中文文档（三）

原文：huggingface.co/docs/transformers

零样本图像分类

原文链接：huggingface.co/docs/transformers/v4.37.2/en/tasks/zero_shot_image_classification

零样本图像分类是一个任务，涉及使用未明确训练包含来自这些特定类别的标记示例的数据的模型将图像分类为不同的类别。

传统上，图像分类需要在特定一组带标签的图像上训练模型，该模型学习将某些图像特征“映射”到标签。当需要将这样的模型用于引入新标签集的分类任务时，需要进行微调以“重新校准”模型。

相比之下，零样本或开放词汇图像分类模型通常是多模态模型，已经在大量图像和相关描述的数据集上进行了训练。这些模型学习了对齐的视觉-语言表示，可用于许多下游任务，包括零样本图像分类。

这是一种更灵活的图像分类方法，允许模型推广到新的和未见过的类别，而无需额外的训练数据，并且使用户能够使用目标对象的自由形式文本描述查询图像。

在本指南中，您将学习如何：

创建一个零样本图像分类管道
手动运行零样本图像分类推理

在开始之前，请确保已安装所有必要的库：

pip install -q transformers

零样本图像分类管道

尝试使用支持零样本图像分类的模型进行推理的最简单方法是使用相应的 pipeline()。从Hugging Face Hub 上的检查点实例化一个管道：

>>> from transformers import pipeline

>>> checkpoint = "openai/clip-vit-large-patch14"
>>> detector = pipeline(model=checkpoint, task="zero-shot-image-classification")

接下来，选择一个您想要分类的图像。

>>> from PIL import Image
>>> import requests

>>> url = "https://unsplash.com/photos/g8oS8-82DxI/download?ixid=MnwxMjA3fDB8MXx0b3BpY3x8SnBnNktpZGwtSGt8fHx8fDJ8fDE2NzgxMDYwODc&force=true&w=640"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image

猫头鹰的照片

将图像和候选对象标签传递给管道。在这里，我们直接传递图像；其他合适的选项包括图像的本地路径或图像 url。候选标签可以像这个例子中那样简单，也可以更具描述性。

>>> predictions = detector(image, candidate_labels=["fox", "bear", "seagull", "owl"])
>>> predictions
[{'score': 0.9996670484542847, 'label': 'owl'},
 {'score': 0.000199399160919711, 'label': 'seagull'},
 {'score': 7.392891711788252e-05, 'label': 'fox'},
 {'score': 5.96074532950297e-05, 'label': 'bear'}]

手动进行零样本图像分类

现在您已经看到如何使用零样本图像分类管道，让我们看看如何手动运行零样本图像分类。

首先从Hugging Face Hub 上的检查点加载模型和相关处理器。在这里，我们将使用与之前相同的检查点：

>>> from transformers import AutoProcessor, AutoModelForZeroShotImageClassification

>>> model = AutoModelForZeroShotImageClassification.from_pretrained(checkpoint)
>>> processor = AutoProcessor.from_pretrained(checkpoint)

让我们换一张不同的图片。

>>> from PIL import Image
>>> import requests

>>> url = "https://unsplash.com/photos/xBRQfR2bqNI/download?ixid=MnwxMjA3fDB8MXxhbGx8fHx8fHx8fHwxNjc4Mzg4ODEx&force=true&w=640"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image

汽车的照片

使用处理器为模型准备输入。处理器结合了一个图像处理器，通过调整大小和归一化来为模型准备图像，以及一个标记器，负责处理文本输入。

>>> candidate_labels = ["tree", "car", "bike", "cat"]
>>> inputs = processor(images=image, text=candidate_labels, return_tensors="pt", padding=True)

通过模型传递输入，并对结果进行后处理：

>>> import torch

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> logits = outputs.logits_per_image[0]
>>> probs = logits.softmax(dim=-1).numpy()
>>> scores = probs.tolist()

>>> result = [
...     {"score": score, "label": candidate_label}
...     for score, candidate_label in sorted(zip(probs, candidate_labels), key=lambda x: -x[0])
... ]

>>> result
[{'score': 0.998572, 'label': 'car'},
 {'score': 0.0010570387, 'label': 'bike'},
 {'score': 0.0003393686, 'label': 'tree'},
 {'score': 3.1572064e-05, 'label': 'cat'}]

单目深度估计

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/monocular_depth_estimation

单目深度估计是一个涉及从单个图像预测场景深度信息的计算机视觉任务。换句话说，它是从单个摄像机视角估计场景中物体的距离的过程。

单目深度估计具有各种应用，包括 3D 重建，增强现实，自动驾驶和机器人技术。这是一个具有挑战性的任务，因为它要求模型理解场景中物体之间以及相应深度信息之间的复杂关系，这些关系可能受到光照条件、遮挡和纹理等因素的影响。

本教程中展示的任务由以下模型架构支持：

DPT, GLPN

在本指南中，您将学习如何：

创建深度估计管道
手动运行深度估计推断

在开始之前，请确保已安装所有必要的库：

pip install -q transformers

深度估计管道

尝试使用支持深度估计的模型进行推断的最简单方法是使用相应的 pipeline()。从Hugging Face Hub 上的检查点实例化一个管道：

>>> from transformers import pipeline

>>> checkpoint = "vinvino02/glpn-nyu"
>>> depth_estimator = pipeline("depth-estimation", model=checkpoint)

接下来，选择要分析的图像：

>>> from PIL import Image
>>> import requests

>>> url = "https://unsplash.com/photos/HwBAsSbPBDU/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MzR8fGNhciUyMGluJTIwdGhlJTIwc3RyZWV0fGVufDB8MHx8fDE2Nzg5MDEwODg&force=true&w=640"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> image

繁忙街道的照片

将图像传递给管道。

>>> predictions = depth_estimator(image)

管道返回一个带有两个条目的字典。第一个条目名为predicted_depth，是一个张量，其值为每个像素的以米为单位的深度。第二个条目depth是一个 PIL 图像，可视化深度估计结果。

让我们看一下可视化结果：

>>> predictions["depth"]

深度估计可视化

手动进行深度估计推断

现在您已经看到如何使用深度估计管道，让我们看看如何手动复制相同的结果。

从Hugging Face Hub 上的检查点加载模型和相关处理器开始。这里我们将使用与之前相同的检查点：

>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation

>>> checkpoint = "vinvino02/glpn-nyu"

>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)
>>> model = AutoModelForDepthEstimation.from_pretrained(checkpoint)

使用image_processor准备模型的图像输入，该处理器将处理必要的图像转换，如调整大小和归一化：

>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values

通过模型传递准备好的输入：

>>> import torch

>>> with torch.no_grad():
...     outputs = model(pixel_values)
...     predicted_depth = outputs.predicted_depth

可视化结果：

>>> import numpy as np

>>> # interpolate to original size
>>> prediction = torch.nn.functional.interpolate(
...     predicted_depth.unsqueeze(1),
...     size=image.size[::-1],
...     mode="bicubic",
...     align_corners=False,
... ).squeeze()
>>> output = prediction.numpy()

>>> formatted = (output * 255 / np.max(output)).astype("uint8")
>>> depth = Image.fromarray(formatted)
>>> depth

深度估计可视化

图像到图像任务指南

原文链接：huggingface.co/docs/transformers/v4.37.2/en/tasks/image_to_image

图像到图像任务是一个应用程序接收图像并输出另一幅图像的任务。这包括各种子任务，包括图像增强（超分辨率、低光增强、去雨等）、图像修补等。

本指南将向您展示如何：

使用图像到图像管道进行超分辨率任务，
运行相同任务的图像到图像模型，而不使用管道。

请注意，截至本指南发布时，图像到图像管道仅支持超分辨率任务。

让我们开始安装必要的库。

pip install transformers

现在我们可以使用Swin2SR 模型初始化管道。然后，通过调用图像来推断管道。目前，此管道仅支持Swin2SR 模型。

from transformers import pipeline

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
pipe = pipeline(task="image-to-image", model="caidas/swin2SR-lightweight-x2-64", device=device)

现在，让我们加载一张图像。

from PIL import Image
import requests

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/cat.jpg"
image = Image.open(requests.get(url, stream=True).raw)

print(image.size)

# (532, 432)

一只猫的照片

现在我们可以使用管道进行推断。我们将得到一张猫图像的放大版本。

upscaled = pipe(image)
print(upscaled.size)

# (1072, 880)

如果您希望自己进行推断而不使用管道，可以使用 transformers 的Swin2SRForImageSuperResolution和Swin2SRImageProcessor类。我们将使用相同的模型检查点。让我们初始化模型和处理器。

from transformers import Swin2SRForImageSuperResolution, Swin2SRImageProcessor 

model = Swin2SRForImageSuperResolution.from_pretrained("caidas/swin2SR-lightweight-x2-64").to(device)
processor = Swin2SRImageProcessor("caidas/swin2SR-lightweight-x2-64")

pipeline抽象了我们必须自己完成的预处理和后处理步骤，因此让我们对图像进行预处理。我们将图像传递给处理器，然后将像素值移动到 GPU。

pixel_values = processor(image, return_tensors="pt").pixel_values
print(pixel_values.shape)

pixel_values = pixel_values.to(device)

现在我们可以通过将像素值传递给模型来推断图像。

import torch

with torch.no_grad():
  outputs = model(pixel_values)

输出是一个类型为ImageSuperResolutionOutput的对象，看起来像下面这样👇

(loss=None, reconstruction=tensor([[[[0.8270, 0.8269, 0.8275,  ..., 0.7463, 0.7446, 0.7453],
          [0.8287, 0.8278, 0.8283,  ..., 0.7451, 0.7448, 0.7457],
          [0.8280, 0.8273, 0.8269,  ..., 0.7447, 0.7446, 0.7452],
          ...,
          [0.5923, 0.5933, 0.5924,  ..., 0.0697, 0.0695, 0.0706],
          [0.5926, 0.5932, 0.5926,  ..., 0.0673, 0.0687, 0.0705],
          [0.5927, 0.5914, 0.5922,  ..., 0.0664, 0.0694, 0.0718]]]],
       device='cuda:0'), hidden_states=None, attentions=None)

我们需要获取reconstruction并对其进行后处理以进行可视化。让我们看看它是什么样子的。

outputs.reconstruction.data.shape
# torch.Size([1, 3, 880, 1072])

我们需要挤压输出并去掉轴 0，裁剪值，然后将其转换为 numpy 浮点数。然后我们将排列轴以获得形状[1072, 880]，最后将输出带回范围[0, 255]。

import numpy as np

# squeeze, take to CPU and clip the values
output = outputs.reconstruction.data.squeeze().cpu().clamp_(0, 1).numpy()
# rearrange the axes
output = np.moveaxis(output, source=0, destination=-1)
# bring values back to pixel values range
output = (output * 255.0).round().astype(np.uint8)
Image.fromarray(output)

一只猫的放大照片

计算机视觉知识蒸馏

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/knowledge_distillation_for_image_classification

知识蒸馏是一种技术，用于将知识从一个更大、更复杂的模型（教师）转移到一个更小、更简单的模型（学生）。为了从一个模型中提取知识到另一个模型，我们采用一个在特定任务上（本例中为图像分类）训练过的预训练教师模型，并随机初始化一个学生模型用于图像分类训练。接下来，我们训练学生模型以最小化其输出与教师输出之间的差异，从而使其模仿行为。这最初是由 Hinton 等人在神经网络中提取知识中首次引入的。在这个指南中，我们将进行特定任务的知识蒸馏。我们将使用 beans 数据集。

这个指南演示了如何使用 🤗 Transformers 的 Trainer API 将一个 fine-tuned ViT 模型（教师模型）蒸馏到一个 MobileNet（学生模型）。

让我们安装进行蒸馏和评估过程所需的库。

pip install transformers datasets accelerate tensorboard evaluate --upgrade

在这个例子中，我们使用 merve/beans-vit-224 模型作为教师模型。这是一个基于 google/vit-base-patch16-224-in21k 在 beans 数据集上微调的图像分类模型。我们将将这个模型蒸馏到一个随机初始化的 MobileNetV2。

我们现在将加载数据集。

from datasets import load_dataset

dataset = load_dataset("beans")

我们可以从任一模型中使用图像处理器，因为在这种情况下它们返回相同分辨率的相同输出。我们将使用 dataset 的 map() 方法将预处理应用于数据集的每个拆分。

from transformers import AutoImageProcessor
teacher_processor = AutoImageProcessor.from_pretrained("merve/beans-vit-224")

def process(examples):
    processed_inputs = teacher_processor(examples["image"])
    return processed_inputs

processed_datasets = dataset.map(process, batched=True)

基本上，我们希望学生模型（随机初始化的 MobileNet）模仿教师模型（微调的视觉变换器）。为了实现这一点，我们首先从教师和学生中获取 logits 输出。然后，我们将它们中的每一个除以控制每个软目标重要性的参数 temperature。一个称为 lambda 的参数权衡了蒸馏损失的重要性。在这个例子中，我们将使用 temperature=5 和 lambda=0.5。我们将使用 Kullback-Leibler 散度损失来计算学生和教师之间的差异。给定两个数据 P 和 Q，KL 散度解释了我们需要多少额外信息来用 Q 表示 P。如果两者相同，它们的 KL 散度为零，因为不需要其他信息来解释 P。因此，在知识蒸馏的背景下，KL 散度是有用的。

from transformers import TrainingArguments, Trainer
import torch
import torch.nn as nn
import torch.nn.functional as F

class ImageDistilTrainer(Trainer):
    def __init__(self, teacher_model=None, student_model=None, temperature=None, lambda_param=None,  *args, **kwargs):
        super().__init__(model=student_model, *args, **kwargs)
        self.teacher = teacher_model
        self.student = student_model
        self.loss_function = nn.KLDivLoss(reduction="batchmean")
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.teacher.to(device)
        self.teacher.eval()
        self.temperature = temperature
        self.lambda_param = lambda_param

    def compute_loss(self, student, inputs, return_outputs=False):
        student_output = self.student(**inputs)

        with torch.no_grad():
          teacher_output = self.teacher(**inputs)

        # Compute soft targets for teacher and student
        soft_teacher = F.softmax(teacher_output.logits / self.temperature, dim=-1)
        soft_student = F.log_softmax(student_output.logits / self.temperature, dim=-1)

        # Compute the loss
        distillation_loss = self.loss_function(soft_student, soft_teacher) * (self.temperature ** 2)

        # Compute the true label loss
        student_target_loss = student_output.loss

        # Calculate final loss
        loss = (1. - self.lambda_param) * student_target_loss + self.lambda_param * distillation_loss
        return (loss, student_output) if return_outputs else loss

我们现在将登录到 Hugging Face Hub，这样我们就可以通过 Trainer 将我们的模型推送到 Hugging Face Hub。

from huggingface_hub import notebook_login

notebook_login()

让我们设置 TrainingArguments、教师模型和学生模型。

from transformers import AutoModelForImageClassification, MobileNetV2Config, MobileNetV2ForImageClassification

training_args = TrainingArguments(
    output_dir="my-awesome-model",
    num_train_epochs=30,
    fp16=True,
    logging_dir=f"{repo_name}/logs",
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="tensorboard",
    push_to_hub=True,
    hub_strategy="every_save",
    hub_model_id=repo_name,
    )

num_labels = len(processed_datasets["train"].features["labels"].names)

# initialize models
teacher_model = AutoModelForImageClassification.from_pretrained(
    "merve/beans-vit-224",
    num_labels=num_labels,
    ignore_mismatched_sizes=True
)

# training MobileNetV2 from scratch
student_config = MobileNetV2Config()
student_config.num_labels = num_labels
student_model = MobileNetV2ForImageClassification(student_config)

我们可以使用 compute_metrics 函数在测试集上评估我们的模型。这个函数将在训练过程中用于计算我们模型的 准确率 和 f1。

import evaluate
import numpy as np

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    acc = accuracy.compute(references=labels, predictions=np.argmax(predictions, axis=1))
    return {"accuracy": acc["accuracy"]}

让我们使用我们定义的训练参数初始化 Trainer。我们还将初始化我们的数据收集器。

from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()
trainer = ImageDistilTrainer(
    student_model=student_model,
    teacher_model=teacher_model,
    training_args=training_args,
    train_dataset=processed_datasets["train"],
    eval_dataset=processed_datasets["validation"],
    data_collator=data_collator,
    tokenizer=teacher_processor,
    compute_metrics=compute_metrics,
    temperature=5,
    lambda_param=0.5
)

我们现在可以训练我们的模型。

trainer.train()

我们可以在测试集上评估模型。

trainer.evaluate(processed_datasets["test"])

在测试集上，我们的模型达到了 72％的准确率。为了对蒸馏效率进行合理性检查，我们还使用相同的超参数从头开始在豆类数据集上训练 MobileNet，并观察到测试集上的 63％准确率。我们邀请读者尝试不同的预训练教师模型、学生架构、蒸馏参数，并报告他们的发现。蒸馏模型的训练日志和检查点可以在此存储库中找到，从头开始训练的 MobileNetV2 可以在此存储库中找到。

多模态

图像字幕

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/image_captioning

图像字幕是预测给定图像的字幕的任务。它的常见实际应用包括帮助视觉障碍人士，帮助他们在不同情况下导航。因此，图像字幕通过向人们描述图像来帮助提高人们对内容的可访问性。

本指南将向您展示如何：

微调图像字幕模型。
用于推理的微调模型。

在开始之前，请确保您已安装所有必要的库：

pip install transformers datasets evaluate -q
pip install jiwer -q

我们鼓励您登录您的 Hugging Face 账户，这样您就可以上传并与社区分享您的模型。在提示时，输入您的令牌以登录：

from huggingface_hub import notebook_login

notebook_login()

加载 Pokemon BLIP 字幕数据集

使用🤗数据集库加载一个由{图像-标题}对组成的数据集。要在 PyTorch 中创建自己的图像字幕数据集，您可以参考此笔记本。

from datasets import load_dataset

ds = load_dataset("lambdalabs/pokemon-blip-captions")
ds

DatasetDict({
    train: Dataset({
        features: ['image', 'text'],
        num_rows: 833
    })
})

数据集有两个特征，图像和文本。

许多图像字幕数据集包含每个图像的多个字幕。在这种情况下，一个常见的策略是在训练过程中在可用的字幕中随机抽取一个字幕。

使用[~datasets.Dataset.train_test_split]方法将数据集的训练集拆分为训练集和测试集：

ds = ds["train"].train_test_split(test_size=0.1)
train_ds = ds["train"]
test_ds = ds["test"]

让我们从训练集中可视化几个样本。

from textwrap import wrap
import matplotlib.pyplot as plt
import numpy as np

def plot_images(images, captions):
    plt.figure(figsize=(20, 20))
    for i in range(len(images)):
        ax = plt.subplot(1, len(images), i + 1)
        caption = captions[i]
        caption = "\n".join(wrap(caption, 12))
        plt.title(caption)
        plt.imshow(images[i])
        plt.axis("off")

sample_images_to_visualize = [np.array(train_ds[i]["image"]) for i in range(5)]
sample_captions = [train_ds[i]["text"] for i in range(5)]
plot_images(sample_images_to_visualize, sample_captions)

样本训练图像

预处理数据集

由于数据集具有两种模态（图像和文本），预处理流水线将预处理图像和标题。

为此，加载与您即将微调的模型相关联的处理器类。

from transformers import AutoProcessor

checkpoint = "microsoft/git-base"
processor = AutoProcessor.from_pretrained(checkpoint)

处理器将在内部预处理图像（包括调整大小和像素缩放）并对标题进行标记。

def transforms(example_batch):
    images = [x for x in example_batch["image"]]
    captions = [x for x in example_batch["text"]]
    inputs = processor(images=images, text=captions, padding="max_length")
    inputs.update({"labels": inputs["input_ids"]})
    return inputs

train_ds.set_transform(transforms)
test_ds.set_transform(transforms)

有了准备好的数据集，您现在可以为微调设置模型。

加载基础模型

将“microsoft/git-base”加载到AutoModelForCausalLM对象中。

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(checkpoint)

评估

图像字幕模型通常使用Rouge Score或Word Error Rate进行评估。在本指南中，您将使用 Word Error Rate (WER)。

我们使用🤗评估库来做到这一点。有关 WER 的潜在限制和其他注意事项，请参考此指南。

from evaluate import load
import torch

wer = load("wer")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predicted = logits.argmax(-1)
    decoded_labels = processor.batch_decode(labels, skip_special_tokens=True)
    decoded_predictions = processor.batch_decode(predicted, skip_special_tokens=True)
    wer_score = wer.compute(predictions=decoded_predictions, references=decoded_labels)
    return {"wer_score": wer_score}

训练！

现在，您已经准备好开始微调模型了。您将使用🤗Trainer 来进行此操作。

首先，使用 TrainingArguments 定义训练参数。

from transformers import TrainingArguments, Trainer

model_name = checkpoint.split("/")[1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-pokemon",
    learning_rate=5e-5,
    num_train_epochs=50,
    fp16=True,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    gradient_accumulation_steps=2,
    save_total_limit=3,
    evaluation_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    logging_steps=50,
    remove_unused_columns=False,
    push_to_hub=True,
    label_names=["labels"],
    load_best_model_at_end=True,
)

然后将它们与数据集和模型一起传递给🤗 Trainer。

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    compute_metrics=compute_metrics,
)

要开始训练，只需在 Trainer 对象上调用 train()。

trainer.train()

您应该看到随着训练的进行，训练损失平稳下降。

一旦训练完成，使用 push_to_hub()方法将您的模型共享到 Hub，以便每个人都可以使用您的模型：

trainer.push_to_hub()

推理

从test_ds中取一个样本图像来测试模型。

from PIL import Image
import requests

url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/pokemon.png"
image = Image.open(requests.get(url, stream=True).raw)
image

测试图片为模型准备图像。

device = "cuda" if torch.cuda.is_available() else "cpu"

inputs = processor(images=image, return_tensors="pt").to(device)
pixel_values = inputs.pixel_values

调用generate并解码预测。

generated_ids = model.generate(pixel_values=pixel_values, max_length=50)
generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_caption)

a drawing of a pink and blue pokemon

看起来微调的模型生成了一个相当不错的字幕！

文档问答

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/document_question_answering

文档问答，也称为文档视觉问答，是一个涉及提供关于文档图像的问题的答案的任务。支持此任务的模型的输入通常是图像和问题的组合，输出是用自然语言表达的答案。这些模型利用多种模态，包括文本、单词的位置（边界框）和图像本身。

本指南说明了如何：

在 DocVQA 数据集上对 LayoutLMv2 进行微调。
使用您微调的模型进行推断。

本教程中演示的任务由以下模型架构支持：

LayoutLM, LayoutLMv2, LayoutLMv3

LayoutLMv2 通过在标记的最终隐藏状态之上添加一个问题-回答头来解决文档问答任务，以预测答案的开始和结束标记的位置。换句话说，这个问题被视为抽取式问答：在给定上下文的情况下，提取哪个信息片段回答问题。上下文来自 OCR 引擎的输出，这里是 Google 的 Tesseract。

在开始之前，请确保您已安装所有必要的库。LayoutLMv2 依赖于 detectron2、torchvision 和 tesseract。

pip install -q transformers datasets

pip install 'git+https://github.com/facebookresearch/detectron2.git'
pip install torchvision

sudo apt install tesseract-ocr
pip install -q pytesseract

安装完所有依赖项后，请重新启动您的运行时。

我们鼓励您与社区分享您的模型。登录到您的 Hugging Face 账户，将其上传到 🤗 Hub。在提示时，输入您的令牌以登录：

>>> from huggingface_hub import notebook_login

>>> notebook_login()

让我们定义一些全局变量。

>>> model_checkpoint = "microsoft/layoutlmv2-base-uncased"
>>> batch_size = 4

加载数据

在本指南中，我们使用了一个小样本的预处理 DocVQA，您可以在 🤗 Hub 上找到。如果您想使用完整的 DocVQA 数据集，您可以在 DocVQA 主页上注册并下载。如果您这样做了，要继续本指南，请查看如何将文件加载到 🤗 数据集。

>>> from datasets import load_dataset

>>> dataset = load_dataset("nielsr/docvqa_1200_examples")
>>> dataset
DatasetDict({
    train: Dataset({
        features: ['id', 'image', 'query', 'answers', 'words', 'bounding_boxes', 'answer'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['id', 'image', 'query', 'answers', 'words', 'bounding_boxes', 'answer'],
        num_rows: 200
    })
})

如您所见，数据集已经分为训练集和测试集。查看一个随机示例，以熟悉特征。

>>> dataset["train"].features

这里是各个字段代表的含义：

id：示例的 id
image：包含文档图像的 PIL.Image.Image 对象
query：问题字符串 - 自然语言提出的问题，可以是多种语言
answers：人类注释者提供的正确答案列表
words 和 bounding_boxes：OCR 的结果，我们这里不会使用
answer：由另一个模型匹配的答案，我们这里不会使用

让我们只保留英文问题，并且删除包含另一个模型预测的 answer 特征。我们还将从注释者提供的答案集中取第一个答案。或者，您可以随机抽样。

>>> updated_dataset = dataset.map(lambda example: {"question": example["query"]["en"]}, remove_columns=["query"])
>>> updated_dataset = updated_dataset.map(
...     lambda example: {"answer": example["answers"][0]}, remove_columns=["answer", "answers"]
... )

请注意，本指南中使用的 LayoutLMv2 检查点已经训练了 max_position_embeddings = 512（您可以在检查点的 config.json 文件中找到此信息）。我们可以截断示例，但为了避免答案可能在大型文档的末尾并最终被截断的情况，这里我们将删除几个示例，其中嵌入可能会超过 512。如果您的数据集中大多数文档很长，您可以实现一个滑动窗口策略 - 详细信息请查看此笔记本。

>>> updated_dataset = updated_dataset.filter(lambda x: len(x["words"]) + len(x["question"].split()) < 512)

此时让我们还从数据集中删除 OCR 特征。这些是 OCR 的结果，用于微调不同的模型。如果我们想要使用它们，它们仍然需要一些处理，因为它们不符合我们在本指南中使用的模型的输入要求。相反，我们可以在原始数据上同时使用 LayoutLMv2Processor 进行 OCR 和标记化。这样我们将得到与模型预期输入匹配的输入。如果您想手动处理图像，请查看LayoutLMv2模型文档以了解模型期望的输入格式。

>>> updated_dataset = updated_dataset.remove_columns("words")
>>> updated_dataset = updated_dataset.remove_columns("bounding_boxes")

最后，如果我们不查看一个图像示例，数据探索就不会完成。

>>> updated_dataset["train"][11]["image"]

DocVQA 图像示例

预处理数据

文档问答任务是一个多模态任务，您需要确保每种模态的输入都按照模型的期望进行预处理。让我们从加载 LayoutLMv2Processor 开始，它内部结合了一个可以处理图像数据的图像处理器和一个可以编码文本数据的标记器。

>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained(model_checkpoint)

预处理文档图像

首先，让我们通过处理器中的image_processor来为模型准备文档图像。默认情况下，图像处理器将图像调整大小为 224x224，确保它们具有正确的颜色通道顺序，应用 OCR 与 tesseract 获取单词和归一化边界框。在本教程中，所有这些默认设置正是我们所需要的。编写一个函数，将默认图像处理应用于一批图像，并返回 OCR 的结果。

>>> image_processor = processor.image_processor

>>> def get_ocr_words_and_boxes(examples):
...     images = [image.convert("RGB") for image in examples["image"]]
...     encoded_inputs = image_processor(images)

...     examples["image"] = encoded_inputs.pixel_values
...     examples["words"] = encoded_inputs.words
...     examples["boxes"] = encoded_inputs.boxes

...     return examples

为了以快速的方式将此预处理应用于整个数据集，请使用map。

>>> dataset_with_ocr = updated_dataset.map(get_ocr_words_and_boxes, batched=True, batch_size=2)

预处理文本数据

一旦我们对图像应用了 OCR，我们需要对数据集的文本部分进行编码，以准备模型使用。这涉及将我们在上一步中获得的单词和框转换为标记级别的input_ids、attention_mask、token_type_ids和bbox。对于文本预处理，我们将需要处理器中的tokenizer。

>>> tokenizer = processor.tokenizer

除了上述预处理之外，我们还需要为模型添加标签。对于🤗 Transformers 中的xxxForQuestionAnswering模型，标签包括start_positions和end_positions，指示答案的起始和结束的标记在哪里。

让我们从这里开始。定义一个辅助函数，可以在较大的列表（单词列表）中找到一个子列表（答案拆分为单词）。

此函数将接受两个列表作为输入，words_list和answer_list。然后，它将遍历words_list，检查words_list中当前单词（words_list[i]）是否等于answer_list的第一个单词（answer_list[0])，以及从当前单词开始且与answer_list相同长度的words_list子列表是否等于answer_list。如果这个条件为真，表示找到了匹配，函数将记录匹配及其起始索引（idx）和结束索引（idx + len(answer_list) - 1）。如果找到了多个匹配，函数将仅返回第一个。如果没有找到匹配，函数将返回（None，0 和 0）。

>>> def subfinder(words_list, answer_list):
...     matches = []
...     start_indices = []
...     end_indices = []
...     for idx, i in enumerate(range(len(words_list))):
...         if words_list[i] == answer_list[0] and words_list[i : i + len(answer_list)] == answer_list:
...             matches.append(answer_list)
...             start_indices.append(idx)
...             end_indices.append(idx + len(answer_list) - 1)
...     if matches:
...         return matches[0], start_indices[0], end_indices[0]
...     else:
...         return None, 0, 0

为了说明此函数如何找到答案的位置，让我们在一个示例上使用它：

>>> example = dataset_with_ocr["train"][1]
>>> words = [word.lower() for word in example["words"]]
>>> match, word_idx_start, word_idx_end = subfinder(words, example["answer"].lower().split())
>>> print("Question: ", example["question"])
>>> print("Words:", words)
>>> print("Answer: ", example["answer"])
>>> print("start_index", word_idx_start)
>>> print("end_index", word_idx_end)
Question:  Who is in  cc in this letter?
Words: ['wie', 'baw', 'brown', '&', 'williamson', 'tobacco', 'corporation', 'research', '&', 'development', 'internal', 'correspondence', 'to:', 'r.', 'h.', 'honeycutt', 'ce:', 't.f.', 'riehl', 'from:', '.', 'c.j.', 'cook', 'date:', 'may', '8,', '1995', 'subject:', 'review', 'of', 'existing', 'brainstorming', 'ideas/483', 'the', 'major', 'function', 'of', 'the', 'product', 'innovation', 'graup', 'is', 'to', 'develop', 'marketable', 'nove!', 'products', 'that', 'would', 'be', 'profitable', 'to', 'manufacture', 'and', 'sell.', 'novel', 'is', 'defined', 'as:', 'of', 'a', 'new', 'kind,', 'or', 'different', 'from', 'anything', 'seen', 'or', 'known', 'before.', 'innovation', 'is', 'defined', 'as:', 'something', 'new', 'or', 'different', 'introduced;', 'act', 'of', 'innovating;', 'introduction', 'of', 'new', 'things', 'or', 'methods.', 'the', 'products', 'may', 'incorporate', 'the', 'latest', 'technologies,', 'materials', 'and', 'know-how', 'available', 'to', 'give', 'then', 'a', 'unique', 'taste', 'or', 'look.', 'the', 'first', 'task', 'of', 'the', 'product', 'innovation', 'group', 'was', 'to', 'assemble,', 'review', 'and', 'categorize', 'a', 'list', 'of', 'existing', 'brainstorming', 'ideas.', 'ideas', 'were', 'grouped', 'into', 'two', 'major', 'categories', 'labeled', 'appearance', 'and', 'taste/aroma.', 'these', 'categories', 'are', 'used', 'for', 'novel', 'products', 'that', 'may', 'differ', 'from', 'a', 'visual', 'and/or', 'taste/aroma', 'point', 'of', 'view', 'compared', 'to', 'canventional', 'cigarettes.', 'other', 'categories', 'include', 'a', 'combination', 'of', 'the', 'above,', 'filters,', 'packaging', 'and', 'brand', 'extensions.', 'appearance', 'this', 'category', 'is', 'used', 'for', 'novel', 'cigarette', 'constructions', 'that', 'yield', 'visually', 'different', 'products', 'with', 'minimal', 'changes', 'in', 'smoke', 'chemistry', 'two', 'cigarettes', 'in', 'cne.', 'emulti-plug', 'te', 'build', 'yaur', 'awn', 'cigarette.', 'eswitchable', 'menthol', 'or', 'non', 'menthol', 'cigarette.', '*cigarettes', 'with', 'interspaced', 'perforations', 'to', 'enable', 'smoker', 'to', 'separate', 'unburned', 'section', 'for', 'future', 'smoking.', '«short', 'cigarette,', 'tobacco', 'section', '30', 'mm.', '«extremely', 'fast', 'buming', 'cigarette.', '«novel', 'cigarette', 'constructions', 'that', 'permit', 'a', 'significant', 'reduction', 'iretobacco', 'weight', 'while', 'maintaining', 'smoking', 'mechanics', 'and', 'visual', 'characteristics.', 'higher', 'basis', 'weight', 'paper:', 'potential', 'reduction', 'in', 'tobacco', 'weight.', '«more', 'rigid', 'tobacco', 'column;', 'stiffing', 'agent', 'for', 'tobacco;', 'e.g.', 'starch', '*colored', 'tow', 'and', 'cigarette', 'papers;', 'seasonal', 'promotions,', 'e.g.', 'pastel', 'colored', 'cigarettes', 'for', 'easter', 'or', 'in', 'an', 'ebony', 'and', 'ivory', 'brand', 'containing', 'a', 'mixture', 'of', 'all', 'black', '(black', 'paper', 'and', 'tow)', 'and', 'ail', 'white', 'cigarettes.', '499150498']
Answer:  T.F. Riehl
start_index 17
end_index 18

然而，一旦示例被编码，它们将看起来像这样：

>>> encoding = tokenizer(example["question"], example["words"], example["boxes"])
>>> tokenizer.decode(encoding["input_ids"])
[CLS] who is in cc in this letter? [SEP] wie baw brown & williamson tobacco corporation research & development ...

我们需要找到编码输入中答案的位置。

token_type_ids告诉我们哪些标记属于问题，哪些属于文档的单词。
tokenizer.cls_token_id将帮助找到输入开头的特殊标记。
word_ids将帮助将原始words中找到的答案与完全编码输入中的相同答案进行匹配，并确定编码输入中答案的起始/结束位置。

有了这个想法，让我们创建一个函数来对数据集中的一批示例进行编码：

>>> def encode_dataset(examples, max_length=512):
...     questions = examples["question"]
...     words = examples["words"]
...     boxes = examples["boxes"]
...     answers = examples["answer"]

...     # encode the batch of examples and initialize the start_positions and end_positions
...     encoding = tokenizer(questions, words, boxes, max_length=max_length, padding="max_length", truncation=True)
...     start_positions = []
...     end_positions = []

...     # loop through the examples in the batch
...     for i in range(len(questions)):
...         cls_index = encoding["input_ids"][i].index(tokenizer.cls_token_id)

...         # find the position of the answer in example's words
...         words_example = [word.lower() for word in words[i]]
...         answer = answers[i]
...         match, word_idx_start, word_idx_end = subfinder(words_example, answer.lower().split())

...         if match:
...             # if match is found, use `token_type_ids` to find where words start in the encoding
...             token_type_ids = encoding["token_type_ids"][i]
...             token_start_index = 0
...             while token_type_ids[token_start_index] != 1:
...                 token_start_index += 1

...             token_end_index = len(encoding["input_ids"][i]) - 1
...             while token_type_ids[token_end_index] != 1:
...                 token_end_index -= 1

...             word_ids = encoding.word_ids(i)[token_start_index : token_end_index + 1]
...             start_position = cls_index
...             end_position = cls_index

...             # loop over word_ids and increase `token_start_index` until it matches the answer position in words
...             # once it matches, save the `token_start_index` as the `start_position` of the answer in the encoding
...             for id in word_ids:
...                 if id == word_idx_start:
...                     start_position = token_start_index
...                 else:
...                     token_start_index += 1

...             # similarly loop over `word_ids` starting from the end to find the `end_position` of the answer
...             for id in word_ids[::-1]:
...                 if id == word_idx_end:
...                     end_position = token_end_index
...                 else:
...                     token_end_index -= 1

...             start_positions.append(start_position)
...             end_positions.append(end_position)

...         else:
...             start_positions.append(cls_index)
...             end_positions.append(cls_index)

...     encoding["image"] = examples["image"]
...     encoding["start_positions"] = start_positions
...     encoding["end_positions"] = end_positions

...     return encoding

现在我们有了这个预处理函数，我们可以对整个数据集进行编码：

>>> encoded_train_dataset = dataset_with_ocr["train"].map(
...     encode_dataset, batched=True, batch_size=2, remove_columns=dataset_with_ocr["train"].column_names
... )
>>> encoded_test_dataset = dataset_with_ocr["test"].map(
...     encode_dataset, batched=True, batch_size=2, remove_columns=dataset_with_ocr["test"].column_names
... )

让我们看看编码数据集的特征是什么样子的：

>>> encoded_train_dataset.features
{'image': Sequence(feature=Sequence(feature=Sequence(feature=Value(dtype='uint8', id=None), length=-1, id=None), length=-1, id=None), length=-1, id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'bbox': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None),
 'start_positions': Value(dtype='int64', id=None),
 'end_positions': Value(dtype='int64', id=None)}

评估

文档问题回答的评估需要大量的后处理。为了避免占用太多时间，本指南跳过了评估步骤。Trainer 在训练过程中仍会计算评估损失，因此您不会完全不了解模型的性能。提取式问答通常使用 F1/完全匹配进行评估。如果您想自己实现，请查看 Hugging Face 课程的问答章节获取灵感。

训练

恭喜！您已成功完成本指南中最困难的部分，现在您已经准备好训练自己的模型。训练包括以下步骤：

使用与预处理相同的检查点加载 AutoModelForDocumentQuestionAnswering 模型。
在 TrainingArguments 中定义您的训练超参数。
定义一个将示例批处理在一起的函数，这里 DefaultDataCollator 将做得很好
将训练参数传递给 Trainer，以及模型、数据集和数据收集器。
调用 train()来微调您的模型。

>>> from transformers import AutoModelForDocumentQuestionAnswering

>>> model = AutoModelForDocumentQuestionAnswering.from_pretrained(model_checkpoint)

在 TrainingArguments 中使用output_dir指定保存模型的位置，并根据需要配置超参数。如果希望与社区分享模型，请将push_to_hub设置为True（您必须登录 Hugging Face 才能上传模型）。在这种情况下，output_dir也将是将推送模型检查点的存储库的名称。

>>> from transformers import TrainingArguments

>>> # REPLACE THIS WITH YOUR REPO ID
>>> repo_id = "MariaK/layoutlmv2-base-uncased_finetuned_docvqa"

>>> training_args = TrainingArguments(
...     output_dir=repo_id,
...     per_device_train_batch_size=4,
...     num_train_epochs=20,
...     save_steps=200,
...     logging_steps=50,
...     evaluation_strategy="steps",
...     learning_rate=5e-5,
...     save_total_limit=2,
...     remove_unused_columns=False,
...     push_to_hub=True,
... )

定义一个简单的数据收集器来将示例批处理在一起。

>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator()

最后，将所有内容汇总，并调用 train()：

>>> from transformers import Trainer

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     data_collator=data_collator,
...     train_dataset=encoded_train_dataset,
...     eval_dataset=encoded_test_dataset,
...     tokenizer=processor,
... )

>>> trainer.train()

要将最终模型添加到🤗 Hub，创建一个模型卡并调用push_to_hub：

>>> trainer.create_model_card()
>>> trainer.push_to_hub()

推理

现在您已经微调了一个 LayoutLMv2 模型，并将其上传到🤗 Hub，您可以用它进行推理。尝试使用微调模型进行推理的最简单方法是在 Pipeline 中使用它。

让我们举个例子：

>>> example = dataset["test"][2]
>>> question = example["query"]["en"]
>>> image = example["image"]
>>> print(question)
>>> print(example["answers"])
'Who is ‘presiding’ TRRF GENERAL SESSION (PART 1)?'
['TRRF Vice President', 'lee a. waller']

接下来，使用您的模型为文档问题回答实例化一个流水线，并将图像+问题组合传递给它。

>>> from transformers import pipeline

>>> qa_pipeline = pipeline("document-question-answering", model="MariaK/layoutlmv2-base-uncased_finetuned_docvqa")
>>> qa_pipeline(image, question)
[{'score': 0.9949808120727539,
  'answer': 'Lee A. Waller',
  'start': 55,
  'end': 57}]

如果愿意，也可以手动复制流水线的结果：

将一张图片和一个问题，使用模型的处理器为其准备好。
将结果或预处理通过模型前向传递。
模型返回start_logits和end_logits，指示答案起始处和答案结束处的标记。两者的形状都是(batch_size, sequence_length)。
对start_logits和end_logits的最后一个维度进行 argmax 操作，以获取预测的start_idx和end_idx。
使用分词器解码答案。

>>> import torch
>>> from transformers import AutoProcessor
>>> from transformers import AutoModelForDocumentQuestionAnswering

>>> processor = AutoProcessor.from_pretrained("MariaK/layoutlmv2-base-uncased_finetuned_docvqa")
>>> model = AutoModelForDocumentQuestionAnswering.from_pretrained("MariaK/layoutlmv2-base-uncased_finetuned_docvqa")

>>> with torch.no_grad():
...     encoding = processor(image.convert("RGB"), question, return_tensors="pt")
...     outputs = model(**encoding)
...     start_logits = outputs.start_logits
...     end_logits = outputs.end_logits
...     predicted_start_idx = start_logits.argmax(-1).item()
...     predicted_end_idx = end_logits.argmax(-1).item()

>>> processor.tokenizer.decode(encoding.input_ids.squeeze()[predicted_start_idx : predicted_end_idx + 1])
'lee a. waller'

视觉问答

原文链接：huggingface.co/docs/transformers/v4.37.2/en/tasks/visual_question_answering

视觉问答（VQA）是根据图像回答开放式问题的任务。支持此任务的模型的输入通常是图像和问题的组合，输出是用自然语言表达的答案。

VQA 的一些值得注意的用例示例包括：

视障人士的辅助应用程序。
教育：提出关于讲座或教科书中呈现的视觉材料的问题。VQA 也可以用于互动博物馆展览或历史遗址。
客户服务和电子商务：VQA 可以通过让用户询问有关产品的问题来增强用户体验。
图像检索：VQA 模型可用于检索具有特定特征的图像。例如，用户可以询问“有狗吗？”以找到一组图像中所有带有狗的图像。

在本指南中，您将学习如何：

在Graphcore/vqa数据集上对分类 VQA 模型（特别是 ViLT）进行微调。
使用您微调的 ViLT 进行推断。
使用生成模型（如 BLIP-2）进行零样本 VQA 推断。

微调 ViLT

ViLT 模型将文本嵌入集成到 Vision Transformer（ViT）中，使其在视觉和语言预训练（VLP）方面具有最小的设计。该模型可用于多个下游任务。对于 VQA 任务，分类器头部放置在顶部（线性层放在[CLS]标记的最终隐藏状态之上）并随机初始化。因此，视觉问答被视为分类问题。

最近的模型，如 BLIP、BLIP-2 和 InstructBLIP，将 VQA 视为生成任务。在本指南中，我们将说明如何将它们用于零样本 VQA 推断。

在开始之前，请确保已安装所有必要的库。

pip install -q transformers datasets

我们鼓励您与社区分享您的模型。登录到您的 Hugging Face 帐户将其上传到🤗 Hub。在提示时，输入您的令牌以登录：

>>> from huggingface_hub import notebook_login

>>> notebook_login()

让我们将模型检查点定义为全局变量。

>>> model_checkpoint = "dandelin/vilt-b32-mlm"

加载数据

出于说明目的，在本指南中，我们使用了带注释的视觉问答Graphcore/vqa数据集的一个非常小的样本。您可以在🤗 Hub上找到完整的数据集。

作为对Graphcore/vqa数据集的替代，您可以从官方的VQA 数据集页面手动下载相同的数据。如果您希望使用自定义数据跟随教程，请查看🤗数据集文档中的创建图像数据集指南。

让我们加载验证集中的前 200 个示例并探索数据集的特点：

>>> from datasets import load_dataset

>>> dataset = load_dataset("Graphcore/vqa", split="validation[:200]")
>>> dataset
Dataset({
    features: ['question', 'question_type', 'question_id', 'image_id', 'answer_type', 'label'],
    num_rows: 200
})

让我们看一个例子来了解数据集的特点：

>>> dataset[0]
{'question': 'Where is he looking?',
 'question_type': 'none of the above',
 'question_id': 262148000,
 'image_id': '/root/.cache/huggingface/datasets/downloads/extracted/ca733e0e000fb2d7a09fbcc94dbfe7b5a30750681d0e965f8e0a23b1c2f98c75/val2014/COCO_val2014_000000262148.jpg',
 'answer_type': 'other',
 'label': {'ids': ['at table', 'down', 'skateboard', 'table'],
  'weights': [0.30000001192092896,
   1.0,
   0.30000001192092896,
   0.30000001192092896]}}

与任务相关的特征包括：

question：要从图像回答的问题
image_id：问题所指图像的路径
label：注释

我们可以删除其余的特征，因为它们不会是必要的：

>>> dataset = dataset.remove_columns(['question_type', 'question_id', 'answer_type'])

正如您所看到的，label特征包含了同一个问题的几个答案（这里称为ids），这些答案是由不同的人类注释者收集的。这是因为对问题的答案可能是主观的。在这种情况下，问题是“他在看哪里？”。有些人用“向下”注释，其他人用“看着桌子”，另一个人用“滑板”等等。

看一看图像，考虑你会给出什么答案：

>>> from PIL import Image

>>> image = Image.open(dataset[0]['image_id'])
>>> image

VQA 图像示例

由于问题和答案的模糊性，像这样的数据集被视为多标签分类问题（因为可能有多个答案有效）。此外，与其只创建一个独热编码向量，不如创建一个软编码，基于某个答案在注释中出现的次数。

例如，在上面的示例中，因为答案“down”被选中的次数远远超过其他答案，它的得分（数据集中称为weight）为 1.0，而其余答案的得分<1.0。

为了以后用适当的分类头实例化模型，让我们创建两个字典：一个将标签名称映射到整数，另一个将整数映射回标签名称：

>>> import itertools

>>> labels = [item['ids'] for item in dataset['label']]
>>> flattened_labels = list(itertools.chain(*labels))
>>> unique_labels = list(set(flattened_labels))

>>> label2id = {label: idx for idx, label in enumerate(unique_labels)}
>>> id2label = {idx: label for label, idx in label2id.items()}

现在我们有了映射，我们可以用它们的 id 替换字符串答案，并将数据集扁平化，以便进行更方便的进一步预处理。

>>> def replace_ids(inputs):
...   inputs["label"]["ids"] = [label2id[x] for x in inputs["label"]["ids"]]
...   return inputs

>>> dataset = dataset.map(replace_ids)
>>> flat_dataset = dataset.flatten()
>>> flat_dataset.features
{'question': Value(dtype='string', id=None),
 'image_id': Value(dtype='string', id=None),
 'label.ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'label.weights': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None)}

数据预处理

下一步是加载 ViLT 处理器，为模型准备图像和文本数据。ViltProcessor 将 BERT 标记器和 ViLT 图像处理器封装到一个方便的单处理器中：

>>> from transformers import ViltProcessor

>>> processor = ViltProcessor.from_pretrained(model_checkpoint)

为了预处理数据，我们需要使用 ViltProcessor 对图像和问题进行编码。处理器将使用 BertTokenizerFast 对文本进行标记化，并为文本数据创建input_ids、attention_mask和token_type_ids。至于图像，处理器将利用 ViltImageProcessor 来调整大小和规范化图像，并创建pixel_values和pixel_mask。

所有这些预处理步骤都是在幕后完成的，我们只需要调用processor。但是，我们仍然需要准备目标标签。在这种表示中，每个元素对应一个可能的答案（标签）。对于正确答案，元素保存其相应的分数（权重），而其余元素设置为零。

以下函数将processor应用于图像和问题，并按上述描述格式化标签：

>>> import torch

>>> def preprocess_data(examples):
...     image_paths = examples['image_id']
...     images = [Image.open(image_path) for image_path in image_paths]
...     texts = examples['question']    

...     encoding = processor(images, texts, padding="max_length", truncation=True, return_tensors="pt")

...     for k, v in encoding.items():
...           encoding[k] = v.squeeze()

...     targets = []

...     for labels, scores in zip(examples['label.ids'], examples['label.weights']):
...         target = torch.zeros(len(id2label))

...         for label, score in zip(labels, scores):
...             target[label] = score

...         targets.append(target)

...     encoding["labels"] = targets

...     return encoding

要在整个数据集上应用预处理函数，使用🤗 Datasets 的map函数。您可以通过设置batched=True来加速map，以一次处理数据集的多个元素。此时，可以随意删除不需要的列。

>>> processed_dataset = flat_dataset.map(preprocess_data, batched=True, remove_columns=['question','question_type',  'question_id', 'image_id', 'answer_type', 'label.ids', 'label.weights'])
>>> processed_dataset
Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'pixel_values', 'pixel_mask', 'labels'],
    num_rows: 200
})

作为最后一步，使用 DefaultDataCollator 创建一批示例：

>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator()

训练模型

现在您已经准备好开始训练您的模型了！使用 ViltForQuestionAnswering 加载 ViLT。指定标签数量以及标签映射：

>>> from transformers import ViltForQuestionAnswering

>>> model = ViltForQuestionAnswering.from_pretrained(model_checkpoint, num_labels=len(id2label), id2label=id2label, label2id=label2id)

此时，只剩下三个步骤：

在 TrainingArguments 中定义您的训练超参数：

>>> from transformers import TrainingArguments

>>> repo_id = "MariaK/vilt_finetuned_200"

>>> training_args = TrainingArguments(
...     output_dir=repo_id,
...     per_device_train_batch_size=4,
...     num_train_epochs=20,
...     save_steps=200,
...     logging_steps=50,
...     learning_rate=5e-5,
...     save_total_limit=2,
...     remove_unused_columns=False,
...     push_to_hub=True,
... )

将训练参数传递给 Trainer，同时还需要传递模型、数据集、处理器和数据收集器。

>>> from transformers import Trainer

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     data_collator=data_collator,
...     train_dataset=processed_dataset,
...     tokenizer=processor,
... )

调用 train()来微调您的模型。

>>> trainer.train()

一旦训练完成，使用 push_to_hub()方法将您的模型分享到🤗 Hub 上：

>>> trainer.push_to_hub()

推理

现在您已经对 ViLT 模型进行了微调，并将其上传到🤗 Hub，您可以用它进行推理。尝试使用 Pipeline 中的微调模型进行推理的最简单方法。

>>> from transformers import pipeline

>>> pipe = pipeline("visual-question-answering", model="MariaK/vilt_finetuned_200")

本指南中的模型仅在 200 个示例上进行了训练，因此不要对其抱有很大期望。让我们看看它是否至少从数据中学到了一些东西，并从数据集中取第一个示例来说明推理：

>>> example = dataset[0]
>>> image = Image.open(example['image_id'])
>>> question = example['question']
>>> print(question)
>>> pipe(image, question, top_k=1)
"Where is he looking?"
[{'score': 0.5498199462890625, 'answer': 'down'}]

尽管不是很自信，但模型确实学到了一些东西。有了更多的例子和更长的训练，你会得到更好的结果！

如果愿意，您也可以手动复制管道的结果：

拿一张图片和一个问题，使用你模型的处理器为模型准备它们。
将结果或预处理通过模型传递。
从 logits 中获取最可能答案的 id，并在id2label中找到实际答案。

>>> processor = ViltProcessor.from_pretrained("MariaK/vilt_finetuned_200")

>>> image = Image.open(example['image_id'])
>>> question = example['question']

>>> # prepare inputs
>>> inputs = processor(image, question, return_tensors="pt")

>>> model = ViltForQuestionAnswering.from_pretrained("MariaK/vilt_finetuned_200")

>>> # forward pass
>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> logits = outputs.logits
>>> idx = logits.argmax(-1).item()
>>> print("Predicted answer:", model.config.id2label[idx])
Predicted answer: down

零样本 VQA

先前的模型将 VQA 视为分类任务。一些最近的模型，如 BLIP、BLIP-2 和 InstructBLIP，将 VQA 视为生成任务。让我们以 BLIP-2 为例。它引入了一种新的视觉语言预训练范式，其中可以使用任何组合的预训练视觉编码器和 LLM（在BLIP-2 博客文章中了解更多）。这使得在多个视觉语言任务中包括视觉问答上实现了最先进的结果。

让我们说明如何使用这个模型进行 VQA。首先，让我们加载模型。在这里，如果可用，我们将明确将模型发送到 GPU，这在训练时不需要做，因为 Trainer 会自动处理：

>>> from transformers import AutoProcessor, Blip2ForConditionalGeneration
>>> import torch

>>> processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
>>> model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16)
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> model.to(device)

该模型将图像和文本作为输入，因此让我们使用 VQA 数据集中第一个示例中完全相同的图像/问题对：

>>> example = dataset[0]
>>> image = Image.open(example['image_id'])
>>> question = example['question']

要将 BLIP-2 用于视觉问答任务，文本提示必须遵循特定格式：问题：{} 答案：。

>>> prompt = f"Question: {question} Answer:"

现在我们需要使用模型的处理器对图像/提示进行预处理，通过模型传递处理后的输入，并解码输出：

>>> inputs = processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)

>>> generated_ids = model.generate(**inputs, max_new_tokens=10)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
>>> print(generated_text)
"He is looking at the crowd"

正如您所看到的，模型识别了人群和脸部的方向（向下看），但似乎忽略了人群在滑冰者后面的事实。然而，在无法获取人类注释数据集的情况下，这种方法可以快速产生有用的结果。

文本到语音

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/text-to-speech

文本到语音（TTS）是从文本创建自然语音的任务，语音可以用多种语言和多个说话者生成。目前在🤗 Transformers 中有几种文本到语音模型，如 Bark、MMS、VITS 和 SpeechT5。

您可以轻松使用"text-to-audio"流水线（或其别名"text-to-speech"）生成音频。一些模型，如 Bark，还可以被调节以生成非语言交流，如笑声、叹息和哭泣，甚至添加音乐。以下是您如何使用"text-to-speech"流水线与 Bark 的示例：

>>> from transformers import pipeline

>>> pipe = pipeline("text-to-speech", model="suno/bark-small")
>>> text = "[clears throat] This is a test ... and I just took a long pause."
>>> output = pipe(text)

以下是一个代码片段，您可以使用它在笔记本中听取生成的音频：

>>> from IPython.display import Audio
>>> Audio(output["audio"], rate=output["sampling_rate"])

有关 Bark 和其他预训练 TTS 模型的更多示例，请参考我们的音频课程。

如果您想要微调 TTS 模型，目前在🤗 Transformers 中唯一可用的文本到语音模型是 SpeechT5 和 FastSpeech2Conformer，未来将会添加更多。SpeechT5 在文本到语音和语音到文本数据的组合上进行了预训练，使其能够学习文本和语音共享的隐藏表示空间。这意味着相同的预训练模型可以用于不同的任务。此外，SpeechT5 通过 x-vector 说话者嵌入支持多个说话者。

本指南的其余部分将说明如何：

微调 SpeechT5，该模型最初是在英语语音上进行训练的，在VoxPopuli数据集的荷兰语（nl）语言子集上。
使用您精炼的模型进行推理的两种方式之一：使用流水线或直接。

在开始之前，请确保已安装所有必要的库：

pip install datasets soundfile speechbrain accelerate

从源代码安装🤗Transformers，因为并非所有 SpeechT5 功能都已合并到官方发布中：

pip install git+https://github.com/huggingface/transformers.git

要按照本指南操作，您将需要一个 GPU。如果您在笔记本中工作，请运行以下命令以检查 GPU 是否可用：

!nvidia-smi

或者适用于 AMD GPU：

!rocm-smi

我们鼓励您登录您的 Hugging Face 账户，将您的模型上传并与社区分享。在提示时，输入您的令牌以登录：

>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载数据集

VoxPopuli是一个大规模的多语音语料库，包含 2009-2020 年欧洲议会活动录音的数据。它包含了 15 种欧洲语言的带标签音频转录数据。在本指南中，我们使用荷兰语子集，可以随意选择其他子集。

请注意，VoxPopuli 或任何其他自动语音识别（ASR）数据集可能不是训练 TTS 模型的最佳选择。对于 ASR 有益的特性，如过多的背景噪音，在 TTS 中通常是不希望的。然而，找到高质量、多语言和多说话者的 TTS 数据集可能会非常具有挑战性。

让我们加载数据：

>>> from datasets import load_dataset, Audio

>>> dataset = load_dataset("facebook/voxpopuli", "nl", split="train")
>>> len(dataset)
20968

20968 个示例应该足够进行微调。SpeechT5 期望音频数据的采样率为 16 kHz，因此请确保数据集中的示例符合此要求：

dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

预处理数据

让我们首先定义要使用的模型检查点并加载适当的处理器：

>>> from transformers import SpeechT5Processor

>>> checkpoint = "microsoft/speecht5_tts"
>>> processor = SpeechT5Processor.from_pretrained(checkpoint)

SpeechT5 分词的文本清理

首先清理文本数据。您将需要处理文本的分词器部分：

>>> tokenizer = processor.tokenizer

数据集示例包含raw_text和normalized_text特征。在决定使用哪个特征作为文本输入时，请考虑 SpeechT5 分词器没有任何数字标记。在normalized_text中，数字被写成文本。因此，它更适合，我们建议使用normalized_text作为输入文本。

因为 SpeechT5 是在英语上进行训练的，可能无法识别荷兰数据集中的某些字符。如果保持原样，这些字符将被转换为<unk>标记。然而，在荷兰语中，像à这样的特定字符用于强调音节。为了保留文本的含义，我们可以用普通的a替换这个字符。

为了识别不支持的标记，使用SpeechT5Tokenizer提取数据集中的所有唯一字符，该分词器使用字符作为标记。为此，编写extract_all_chars映射函数，将所有示例的转录连接成一个字符串，并将其转换为字符集。确保在dataset.map()中设置batched=True和batch_size=-1，以便所有转录都可以一次性用于映射函数。

>>> def extract_all_chars(batch):
...     all_text = " ".join(batch["normalized_text"])
...     vocab = list(set(all_text))
...     return {"vocab": [vocab], "all_text": [all_text]}

>>> vocabs = dataset.map(
...     extract_all_chars,
...     batched=True,
...     batch_size=-1,
...     keep_in_memory=True,
...     remove_columns=dataset.column_names,
... )

>>> dataset_vocab = set(vocabs["vocab"][0])
>>> tokenizer_vocab = {k for k, _ in tokenizer.get_vocab().items()}

现在您有两组字符：一组来自数据集的词汇表，另一组来自分词器的词汇表。为了识别数据集中的任何不支持的字符，您可以取这两组之间的差集。结果集将包含数据集中存在但不在分词器中的字符。

>>> dataset_vocab - tokenizer_vocab
{' ', 'à', 'ç', 'è', 'ë', 'í', 'ï', 'ö', 'ü'}

为了处理前一步骤中识别出的不支持的字符，定义一个函数，将这些字符映射到有效的标记。请注意，分词器中的空格已经被替换为▁，不需要单独处理。

>>> replacements = [
...     ("à", "a"),
...     ("ç", "c"),
...     ("è", "e"),
...     ("ë", "e"),
...     ("í", "i"),
...     ("ï", "i"),
...     ("ö", "o"),
...     ("ü", "u"),
... ]

>>> def cleanup_text(inputs):
...     for src, dst in replacements:
...         inputs["normalized_text"] = inputs["normalized_text"].replace(src, dst)
...     return inputs

>>> dataset = dataset.map(cleanup_text)

现在您已经处理了文本中的特殊字符，是时候将重点转移到音频数据上了。

发言者

VoxPopuli 数据集包含多位发言者的讲话，但数据集中代表了多少位发言者？为了确定这一点，我们可以计算独特发言者的数量以及每位发言者对数据集的贡献示例数量。在数据集中共有 20,968 个示例，这些信息将帮助我们更好地了解数据中发言者和示例的分布。

>>> from collections import defaultdict

>>> speaker_counts = defaultdict(int)

>>> for speaker_id in dataset["speaker_id"]:
...     speaker_counts[speaker_id] += 1

通过绘制直方图，您可以了解每位发言者的数据量。

>>> import matplotlib.pyplot as plt

>>> plt.figure()
>>> plt.hist(speaker_counts.values(), bins=20)
>>> plt.ylabel("Speakers")
>>> plt.xlabel("Examples")
>>> plt.show()

发言者直方图

直方图显示，数据集中大约三分之一的发言者拥有少于 100 个示例，而大约有十位发言者拥有超过 500 个示例。为了提高训练效率并平衡数据集，我们可以将数据限制在具有 100 到 400 个示例之间的发言者。

>>> def select_speaker(speaker_id):
...     return 100 <= speaker_counts[speaker_id] <= 400

>>> dataset = dataset.filter(select_speaker, input_columns=["speaker_id"])

让我们检查还剩下多少发言者：

>>> len(set(dataset["speaker_id"]))
42

让我们看看还剩下多少示例：

>>> len(dataset)
9973

您现在剩下大约 40 位独特发言者的不到 10,000 个示例，这应该足够了。

请注意，一些示例较少的发言者实际上可能有更多的音频可用，如果示例很长。然而，确定每位发言者的总音频量需要扫描整个数据集，这是一个耗时的过程，涉及加载和解码每个音频文件。因此，我们选择跳过这一步骤。

发言者嵌入

为了使 TTS 模型能够区分多个发言者，您需要为每个示例创建一个发言者嵌入。发言者嵌入是模型的另一个输入，捕捉特定发言者的语音特征。为了生成这些发言者嵌入，使用 SpeechBrain 中的预训练spkrec-xvect-voxceleb模型。

创建一个名为create_speaker_embedding()的函数，该函数接受输入音频波形，并输出一个包含相应发言者嵌入的 512 元素向量。

>>> import os
>>> import torch
>>> from speechbrain.pretrained import EncoderClassifier

>>> spk_model_name = "speechbrain/spkrec-xvect-voxceleb"

>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> speaker_model = EncoderClassifier.from_hparams(
...     source=spk_model_name,
...     run_opts={"device": device},
...     savedir=os.path.join("/tmp", spk_model_name),
... )

>>> def create_speaker_embedding(waveform):
...     with torch.no_grad():
...         speaker_embeddings = speaker_model.encode_batch(torch.tensor(waveform))
...         speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2)
...         speaker_embeddings = speaker_embeddings.squeeze().cpu().numpy()
...     return speaker_embeddings

重要的是要注意，speechbrain/spkrec-xvect-voxceleb模型是在 VoxCeleb 数据集的英语语音上训练的，而本指南中的训练示例是荷兰语。虽然我们相信这个模型仍然会为我们的荷兰数据集生成合理的说话者嵌入，但这种假设在所有情况下可能并不成立。

为了获得最佳结果，我们建议首先在目标语音上训练一个 X-vector 模型。这将确保模型更好地捕捉荷兰语中存在的独特语音特征。

处理数据集

最后，让我们将数据处理成模型期望的格式。创建一个prepare_dataset函数，该函数接受一个单个示例，并使用SpeechT5Processor对象对输入文本进行标记化，并将目标音频加载到对数梅尔频谱图中。它还应该添加说话者嵌入作为额外输入。

>>> def prepare_dataset(example):
...     audio = example["audio"]

...     example = processor(
...         text=example["normalized_text"],
...         audio_target=audio["array"],
...         sampling_rate=audio["sampling_rate"],
...         return_attention_mask=False,
...     )

...     # strip off the batch dimension
...     example["labels"] = example["labels"][0]

...     # use SpeechBrain to obtain x-vector
...     example["speaker_embeddings"] = create_speaker_embedding(audio["array"])

...     return example

通过查看单个示例来验证处理是否正确：

>>> processed_example = prepare_dataset(dataset[0])
>>> list(processed_example.keys())
['input_ids', 'labels', 'stop_labels', 'speaker_embeddings']

说话者嵌入应该是一个 512 元素向量：

>>> processed_example["speaker_embeddings"].shape
(512,)

标签应该是一个具有 80 个 mel 频率箱的对数梅尔频谱图。

>>> import matplotlib.pyplot as plt

>>> plt.figure()
>>> plt.imshow(processed_example["labels"].T)
>>> plt.show()

具有 80 个 mel 频率箱的对数梅尔频谱图

侧记：如果您觉得这个频谱图令人困惑，可能是因为您熟悉将低频放在底部，高频放在顶部的惯例。然而，在使用 matplotlib 库将频谱图绘制为图像时，y 轴是翻转的，频谱图看起来是倒置的。

现在将处理函数应用于整个数据集。这将需要 5 到 10 分钟。

>>> dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)

您将看到一个警告，说数据集中的一些示例比模型可以处理的最大输入长度（600 个标记）要长。从数据集中删除这些示例。在这里，我们甚至进一步去除了超过 200 个标记的任何内容，以允许更大的批次大小。

>>> def is_not_too_long(input_ids):
...     input_length = len(input_ids)
...     return input_length < 200

>>> dataset = dataset.filter(is_not_too_long, input_columns=["input_ids"])
>>> len(dataset)
8259

接下来，创建一个基本的训练/测试拆分：

>>> dataset = dataset.train_test_split(test_size=0.1)

数据整理器

为了将多个示例组合成一个批次，您需要定义一个自定义数据整理器。这个整理器将用填充标记填充较短的序列，确保所有示例具有相同的长度。对于频谱图标签，填充部分将替换为特殊值-100。这个特殊值指示模型在计算频谱图损失时忽略该部分频谱图。

>>> from dataclasses import dataclass
>>> from typing import Any, Dict, List, Union

>>> @dataclass
... class TTSDataCollatorWithPadding:
...     processor: Any

...     def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
...         input_ids = [{"input_ids": feature["input_ids"]} for feature in features]
...         label_features = [{"input_values": feature["labels"]} for feature in features]
...         speaker_features = [feature["speaker_embeddings"] for feature in features]

...         # collate the inputs and targets into a batch
...         batch = processor.pad(input_ids=input_ids, labels=label_features, return_tensors="pt")

...         # replace padding with -100 to ignore loss correctly
...         batch["labels"] = batch["labels"].masked_fill(batch.decoder_attention_mask.unsqueeze(-1).ne(1), -100)

...         # not used during fine-tuning
...         del batch["decoder_attention_mask"]

...         # round down target lengths to multiple of reduction factor
...         if model.config.reduction_factor > 1:
...             target_lengths = torch.tensor([len(feature["input_values"]) for feature in label_features])
...             target_lengths = target_lengths.new(
...                 [length - length % model.config.reduction_factor for length in target_lengths]
...             )
...             max_length = max(target_lengths)
...             batch["labels"] = batch["labels"][:, :max_length]

...         # also add in the speaker embeddings
...         batch["speaker_embeddings"] = torch.tensor(speaker_features)

...         return batch

在 SpeechT5 中，模型的解码器部分的输入减少了 2 倍。换句话说，它会丢弃目标序列中的每隔一个时间步。然后解码器会预测一个长度是原来两倍的序列。由于原始目标序列长度可能是奇数，数据整理器确保将批次的最大长度向下舍入为 2 的倍数。

>>> data_collator = TTSDataCollatorWithPadding(processor=processor)

训练模型

从与您用于加载处理器的相同检查点加载预训练模型：

>>> from transformers import SpeechT5ForTextToSpeech

>>> model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint)

use_cache=True选项与梯度检查点不兼容。在训练时禁用它。

>>> model.config.use_cache = False

定义训练参数。在训练过程中，我们不计算任何评估指标。相反，我们只关注损失：

>>> from transformers import Seq2SeqTrainingArguments

>>> training_args = Seq2SeqTrainingArguments(
...     output_dir="speecht5_finetuned_voxpopuli_nl",  # change to a repo name of your choice
...     per_device_train_batch_size=4,
...     gradient_accumulation_steps=8,
...     learning_rate=1e-5,
...     warmup_steps=500,
...     max_steps=4000,
...     gradient_checkpointing=True,
...     fp16=True,
...     evaluation_strategy="steps",
...     per_device_eval_batch_size=2,
...     save_steps=1000,
...     eval_steps=1000,
...     logging_steps=25,
...     report_to=["tensorboard"],
...     load_best_model_at_end=True,
...     greater_is_better=False,
...     label_names=["labels"],
...     push_to_hub=True,
... )

实例化Trainer对象，并将模型、数据集和数据整理器传递给它。

>>> from transformers import Seq2SeqTrainer

>>> trainer = Seq2SeqTrainer(
...     args=training_args,
...     model=model,
...     train_dataset=dataset["train"],
...     eval_dataset=dataset["test"],
...     data_collator=data_collator,
...     tokenizer=processor,
... )

有了这些，您现在可以开始训练了！训练将需要几个小时。根据您的 GPU，当您开始训练时可能会遇到 CUDA“内存不足”错误。在这种情况下，您可以逐步减少per_device_train_batch_size，每次减少 2 倍，并将gradient_accumulation_steps增加 2 倍以补偿。

>>> trainer.train()

为了能够使用您的检查点进行管道处理，请确保将处理器与检查点一起保存：

>>> processor.save_pretrained("YOUR_ACCOUNT_NAME/speecht5_finetuned_voxpopuli_nl")

将最终模型推送到🤗 Hub：

>>> trainer.push_to_hub()

推断

使用管道进行推断

很好，现在您已经对模型进行了微调，可以用它进行推断了！首先，让我们看看如何在相应的管道中使用它。让我们使用您的检查点创建一个"text-to-speech"管道：

>>> from transformers import pipeline

>>> pipe = pipeline("text-to-speech", model="YOUR_ACCOUNT_NAME/speecht5_finetuned_voxpopuli_nl")

选择一段荷兰语文本，例如：

>>> text = "hallo allemaal, ik praat nederlands. groetjes aan iedereen!"

要在管道中使用 SpeechT5，您需要一个说话者嵌入。让我们从测试数据集中的一个示例中获取它：

>>> example = dataset["test"][304]
>>> speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0)

现在您可以将文本和说话者嵌入传递给管道，它会处理剩下的部分：

>>> forward_params = {"speaker_embeddings": speaker_embeddings}
>>> output = pipe(text, forward_params=forward_params)
>>> output
{'audio': array([-6.82714235e-05, -4.26525949e-04,  1.06134125e-04, ...,
        -1.22392643e-03, -7.76011671e-04,  3.29112721e-04], dtype=float32),
 'sampling_rate': 16000}

然后您可以听结果：

>>> from IPython.display import Audio
>>> Audio(output['audio'], rate=output['sampling_rate'])

手动运行推断

您可以在不使用管道的情况下实现相同的推断结果，但是需要更多的步骤。

从🤗 Hub 加载模型：

>>> model = SpeechT5ForTextToSpeech.from_pretrained("YOUR_ACCOUNT/speecht5_finetuned_voxpopuli_nl")

从测试数据集中选择一个示例获取说话者嵌入。

>>> example = dataset["test"][304]
>>> speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0)

定义输入文本并对其进行标记化。

>>> text = "hallo allemaal, ik praat nederlands. groetjes aan iedereen!"
>>> inputs = processor(text=text, return_tensors="pt")

使用您的模型创建一个频谱图：

>>> spectrogram = model.generate_speech(inputs["input_ids"], speaker_embeddings)

如果您愿意，可视化频谱图：

>>> plt.figure()
>>> plt.imshow(spectrogram.T)
>>> plt.show()

！[生成的对数梅尔频谱图]（../Images/8bcf491c8356ebfa61722c3c271cd0f7.png）

最后，使用声码器将频谱图转换为声音。

>>> with torch.no_grad():
...     speech = vocoder(spectrogram)

>>> from IPython.display import Audio

>>> Audio(speech.numpy(), rate=16000)

根据我们的经验，从这个模型获得令人满意的结果可能具有挑战性。说话者嵌入的质量似乎是一个重要因素。由于 SpeechT5 是用英语 x-vectors 预训练的，因此在使用英语说话者嵌入时表现最佳。如果合成的语音听起来很差，尝试使用不同的说话者嵌入。

增加训练持续时间也可能会提高结果的质量。即使如此，语音明显是荷兰语而不是英语，并且它捕捉到说话者的声音特征（与示例中的原始音频进行比较）。另一个要尝试的是模型的配置。例如，尝试使用config.reduction_factor = 1，看看是否会改善结果。

最后，重要的是考虑道德考量。尽管 TTS 技术有许多有用的应用，但也可能被用于恶意目的，例如未经他们的知识或同意冒充某人的声音。请明智和负责任地使用 TTS。

生成

文本生成策略

原文：huggingface.co/docs/transformers/v4.37.2/en/generation_strategies

文本生成对于许多 NLP 任务至关重要，例如开放式文本生成、摘要、翻译等。它还在各种混合模态应用中发挥作用，这些应用的输出是文本，如语音转文本和视觉转文本。一些可以生成文本的模型包括 GPT2、XLNet、OpenAI GPT、CTRL、TransformerXL、XLM、Bart、T5、GIT、Whisper。

查看一些使用 generate()方法为不同任务生成文本输出的示例：

文本摘要
图像标题
音频转录

请注意，生成方法的输入取决于模型的模态。它们由模型的预处理器类返回，例如 AutoTokenizer 或 AutoProcessor。如果模型的预处理器创建多种类型的输入，请将所有输入传递给 generate()。您可以在相应模型的文档中了解更多关于各个模型的预处理器的信息。

选择生成文本的输出标记的过程称为解码，您可以自定义generate()方法将使用的解码策略。修改解码策略不会改变任何可训练参数的值。但是，它可能会显著影响生成输出的质量。它可以帮助减少文本中的重复，并使其更连贯。

本指南描述：

默认生成配置
常见的解码策略及其主要参数
在🤗 Hub 上保存和共享自定义生成配置与您的微调模型

默认文本生成配置

模型的解码策略在其生成配置中定义。在管道内使用预训练模型进行推断时，模型调用PreTrainedModel.generate()方法，在幕后应用默认生成配置。当没有保存自定义配置与模型一起时，也会使用默认配置。

当您显式加载模型时，您可以通过model.generation_config检查随之提供的生成配置：

>>> from transformers import AutoModelForCausalLM

>>> model = AutoModelForCausalLM.from_pretrained("distilgpt2")
>>> model.generation_config
GenerationConfig {
    "bos_token_id": 50256,
    "eos_token_id": 50256,
}

打印出model.generation_config只显示与默认生成配置不同的值，并不列出任何默认值。

默认生成配置限制输出与输入提示的组合大小最多为 20 个标记，以避免遇到资源限制。默认解码策略是贪婪搜索，这是一种最简单的解码策略，它选择具有最高概率的标记作为下一个标记。对于许多任务和小输出大小，这种方法效果很好。然而，当用于生成较长的输出时，贪婪搜索可能会开始产生高度重复的结果。

自定义文本生成

您可以通过直接将参数及其值传递给generate方法来覆盖任何generation_config：

>>> my_model.generate(**inputs, num_beams=4, do_sample=True)

即使默认解码策略对您的任务大部分有效，您仍然可以微调一些内容。一些常调整的参数包括：

max_new_tokens：要生成的标记的最大数量。换句话说，输出序列的大小，不包括提示中的标记。作为使用输出长度作为停止标准的替代方案，您可以选择在完整生成超过某个时间量时停止生成。要了解更多信息，请查看 StoppingCriteria。
num_beams：通过指定高于 1 的波束数量，您实际上是从贪婪搜索切换到波束搜索。这种策略在每个时间步评估几个假设，最终选择具有整个序列的最高概率的假设。这有一个优点，可以识别以较低概率初始标记开头的高概率序列，并且会被贪婪搜索忽略。
do_sample：如果设置为True，此参数将启用解码策略，如多项式采样、波束搜索多项式采样、Top-K 采样和 Top-p 采样。所有这些策略从整个词汇表的概率分布中选择下一个标记，具有各种特定策略的调整。
num_return_sequences：要为每个输入返回的序列候选数。此选项仅适用于支持多个序列候选的解码策略，例如波束搜索和采样的变体。贪婪搜索和对比搜索等解码策略返回单个输出序列。

保存带有您的模型的自定义解码策略

如果您想要与特定生成配置共享您微调的模型，您可以：

创建一个 GenerationConfig 类实例
指定解码策略参数
使用 GenerationConfig.save_pretrained()保存您的生成配置，确保将其config_file_name参数留空
将push_to_hub设置为True，将您的配置上传到模型的存储库

>>> from transformers import AutoModelForCausalLM, GenerationConfig

>>> model = AutoModelForCausalLM.from_pretrained("my_account/my_model")
>>> generation_config = GenerationConfig(
...     max_new_tokens=50, do_sample=True, top_k=50, eos_token_id=model.config.eos_token_id
... )
>>> generation_config.save_pretrained("my_account/my_model", push_to_hub=True)

您还可以在单个目录中存储多个生成配置，利用 GenerationConfig.save_pretrained()中的config_file_name参数。您可以稍后使用 GenerationConfig.from_pretrained()实例化它们。如果您想为单个模型存储多个生成配置（例如，一个用于采样的创意文本生成，一个用于波束搜索的摘要），则必须具有正确的 Hub 权限以向模型添加配置文件。

>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig

>>> tokenizer = AutoTokenizer.from_pretrained("t5-small")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

>>> translation_generation_config = GenerationConfig(
...     num_beams=4,
...     early_stopping=True,
...     decoder_start_token_id=0,
...     eos_token_id=model.config.eos_token_id,
...     pad_token=model.config.pad_token_id,
... )

>>> # Tip: add `push_to_hub=True` to push to the Hub
>>> translation_generation_config.save_pretrained("/tmp", "translation_generation_config.json")

>>> # You could then use the named generation config file to parameterize generation
>>> generation_config = GenerationConfig.from_pretrained("/tmp", "translation_generation_config.json")
>>> inputs = tokenizer("translate English to French: Configuration files are easy to use!", return_tensors="pt")
>>> outputs = model.generate(**inputs, generation_config=generation_config)
>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
['Les fichiers de configuration sont faciles à utiliser!']

流式传输

generate()支持流式传输，通过其streamer输入。streamer输入与具有以下方法的类的任何实例兼容：put()和end()。在内部，put()用于推送新标记，end()用于标记文本生成的结束。

流媒体类的 API 仍在开发中，可能会在未来发生变化。

实际上，您可以为各种目的制作自己的流式传输类！我们还为您准备了基本的流式传输类供您使用。例如，您可以使用 TextStreamer 类将generate()的输出流式传输到屏幕上，每次一个词：

>>> from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

>>> tok = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("gpt2")
>>> inputs = tok(["An increasing sequence: one,"], return_tensors="pt")
>>> streamer = TextStreamer(tok)

>>> # Despite returning the usual output, the streamer will also print the generated text to stdout.
>>> _ = model.generate(**inputs, streamer=streamer, max_new_tokens=20)
An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven,

解码策略

某些generate()参数的组合，最终generation_config可以用于启用特定的解码策略。如果您对这个概念还不熟悉，我们建议阅读这篇博文，展示了常见的解码策略如何工作。

在这里，我们将展示控制解码策略的一些参数，并说明如何使用它们。

贪婪搜索

generate默认使用贪婪搜索解码，因此您无需传递任何参数来启用它。这意味着参数num_beams设置为 1，do_sample=False。

>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> prompt = "I look forward to"
>>> checkpoint = "distilgpt2"

>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
>>> outputs = model.generate(**inputs)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['I look forward to seeing you all again!\n\n\n\n\n\n\n\n\n\n\n']

对比搜索

对比搜索解码策略是在 2022 年的论文A Contrastive Framework for Neural Text Generation中提出的。它展示了生成非重复但连贯的长输出的优越结果。要了解对比搜索的工作原理，请查看这篇博客文章。启用和控制对比搜索行为的两个主要参数是penalty_alpha和top_k：

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> checkpoint = "gpt2-large"
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)

>>> prompt = "Hugging Face Company is"
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> outputs = model.generate(**inputs, penalty_alpha=0.6, top_k=4, max_new_tokens=100)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Hugging Face Company is a family owned and operated business. We pride ourselves on being the best
in the business and our customer service is second to none.\n\nIf you have any questions about our
products or services, feel free to contact us at any time. We look forward to hearing from you!']

多项式抽样

与总是选择具有最高概率的标记作为下一个标记的贪婪搜索相反，多项式抽样（也称为祖先抽样）根据模型给出的整个词汇表上的概率分布随机选择下一个标记。每个具有非零概率的标记都有被选择的机会，从而降低重复的风险。

要启用多项式抽样，请设置do_sample=True和num_beams=1。

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
>>> set_seed(0)  # For reproducibility

>>> checkpoint = "gpt2-large"
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)

>>> prompt = "Today was an amazing day because"
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Today was an amazing day because when you go to the World Cup and you don\'t, or when you don\'t get invited,
that\'s a terrible feeling."']

束搜索解码

与贪婪搜索不同，束搜索解码在每个时间步保留几个假设，并最终选择整个序列的总体概率最高的假设。这有助于识别以较低概率初始标记开头的高概率序列，这些序列在贪婪搜索中会被忽略。

要启用这种解码策略，请指定num_beams（即要跟踪的假设数量）大于 1。

>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> prompt = "It is astonishing how one can"
>>> checkpoint = "gpt2-medium"

>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)

>>> outputs = model.generate(**inputs, num_beams=5, max_new_tokens=50)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['It is astonishing how one can have such a profound impact on the lives of so many people in such a short period of
time."\n\nHe added: "I am very proud of the work I have been able to do in the last few years.\n\n"I have']

束搜索多项式抽样

正如其名称所示，这种解码策略将束搜索与多项式抽样结合在一起。您需要指定num_beams大于 1，并设置do_sample=True以使用这种解码策略。

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, set_seed
>>> set_seed(0)  # For reproducibility

>>> prompt = "translate English to German: The house is wonderful."
>>> checkpoint = "t5-small"

>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

>>> outputs = model.generate(**inputs, num_beams=5, do_sample=True)
>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'Das Haus ist wunderbar.'

多样束搜索解码

多样束搜索解码策略是束搜索策略的扩展，允许生成更多样化的束序列供选择。要了解其工作原理，请参考Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models。这种方法有三个主要参数：num_beams、num_beam_groups和diversity_penalty。多样性惩罚确保输出在组间是不同的，并且在每个组内使用束搜索。

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

>>> checkpoint = "google/pegasus-xsum"
>>> prompt = (
...     "The Permaculture Design Principles are a set of universal design principles "
...     "that can be applied to any location, climate and culture, and they allow us to design "
...     "the most efficient and sustainable human habitation and food production systems. "
...     "Permaculture is a design system that encompasses a wide variety of disciplines, such "
...     "as ecology, landscape design, environmental science and energy conservation, and the "
...     "Permaculture design principles are drawn from these various disciplines. Each individual "
...     "design principle itself embodies a complete conceptual framework based on sound "
...     "scientific principles. When we bring all these separate  principles together, we can "
...     "create a design system that both looks at whole systems, the parts that these systems "
...     "consist of, and how those parts interact with each other to create a complex, dynamic, "
...     "living system. Each design principle serves as a tool that allows us to integrate all "
...     "the separate parts of a design, referred to as elements, into a functional, synergistic, "
...     "whole system, where the elements harmoniously interact and work together in the most "
...     "efficient way possible."
... )

>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

>>> outputs = model.generate(**inputs, num_beams=5, num_beam_groups=5, max_new_tokens=30, diversity_penalty=1.0)
>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'The Design Principles are a set of universal design principles that can be applied to any location, climate and
culture, and they allow us to design the'

本指南说明了启用各种解码策略的主要参数。generate方法还有更高级的参数，可以进一步控制generate方法的行为。有关可用参数的完整列表，请参考 API 文档。

推测解码

推测解码（也称为辅助解码）是上述解码策略的修改版本，它使用一个助理模型（理想情况下是一个更小的模型）与相同的分词器，生成一些候选标记。然后主模型在单个前向传递中验证候选标记，从而加快解码过程。如果do_sample=True，则使用推测解码论文中引入的重新抽样进行标记验证。

目前，只支持贪婪搜索和抽样与辅助解码，并且辅助解码不支持批量输入。要了解更多关于辅助解码的信息，请查看这篇博客文章。

要启用辅助解码，请使用一个模型设置assistant_model参数。

>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> prompt = "Alice and Bob"
>>> checkpoint = "EleutherAI/pythia-1.4b-deduped"
>>> assistant_checkpoint = "EleutherAI/pythia-160m-deduped"

>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
>>> assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
>>> outputs = model.generate(**inputs, assistant_model=assistant_model)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']

在使用辅助解码与抽样方法时，您可以使用temperature参数来控制随机性，就像在多项式抽样中一样。然而，在辅助解码中，降低温度可能有助于提高延迟。

>>> from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
>>> set_seed(42)  # For reproducibility

>>> prompt = "Alice and Bob"
>>> checkpoint = "EleutherAI/pythia-1.4b-deduped"
>>> assistant_checkpoint = "EleutherAI/pythia-160m-deduped"

>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")

>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
>>> assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
>>> outputs = model.generate(**inputs, assistant_model=assistant_model, do_sample=True, temperature=0.5)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Alice and Bob are going to the same party. It is a small party, in a small']

提示

使用 IDEFICS 进行图像任务

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/idefics

虽然可以通过微调专门的模型来解决单个任务，但最近出现并受到欢迎的另一种方法是使用大型模型处理各种任务而无需微调。例如，大型语言模型可以处理诸如摘要、翻译、分类等 NLP 任务。这种方法不再局限于单一模态，比如文本，在本指南中，我们将说明如何使用名为 IDEFICS 的大型多模态模型解决图像文本任务。

IDEFICS 是一个基于Flamingo的开放式视觉和语言模型，Flamingo 是由 DeepMind 最初开发的最先进的视觉语言模型。该模型接受任意序列的图像和文本输入，并生成连贯的文本作为输出。它可以回答关于图像的问题，描述视觉内容，创建基于多个图像的故事等。IDEFICS 有两个变体 - 80 亿参数和90 亿参数，这两个变体都可以在🤗 Hub 上找到。对于每个变体，您还可以找到为对话使用案例调整的模型的微调指导版本。

这个模型非常灵活，可以用于各种图像和多模态任务。然而，作为一个大型模型意味着它需要大量的计算资源和基础设施。您需要决定这种方法是否比为每个单独任务微调专门的模型更适合您的用例。

在本指南中，您将学习如何：

加载 IDEFICS 和加载模型的量化版本
使用 IDEFICS 进行：
- 图像加标题
- 提示的图像加标题
- 少样本提示
- 视觉问答
- 图像分类
- 图像引导文本生成
批处理模式下运行推理
运行 IDEFICS 指导进行对话使用

在开始之前，请确保已安装所有必要的库。

pip install -q bitsandbytes sentencepiece accelerate transformers

要运行以下示例，您将需要至少 20GB 的 GPU 内存来使用模型检查点的非量化版本。

加载模型

让我们从加载模型的 90 亿参数检查点开始：

>>> checkpoint = "HuggingFaceM4/idefics-9b"

就像其他 Transformer 模型一样，您需要从检查点加载处理器和模型本身。IDEFICS 处理器将 LlamaTokenizer 和 IDEFICS 图像处理器包装成一个单一处理器，以负责为模型准备文本和图像输入。

>>> import torch

>>> from transformers import IdeficsForVisionText2Text, AutoProcessor

>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto")

将device_map设置为"auto"将自动确定如何以最优化的方式加载和存储模型权重，考虑到现有设备。

量化模型

如果高内存 GPU 可用性是一个问题，您可以加载模型的量化版本。要加载模型和处理器的 4 位精度版本，请将BitsAndBytesConfig传递给from_pretrained方法，模型将在加载时即时压缩。

>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor, BitsAndBytesConfig

>>> quantization_config = BitsAndBytesConfig(
...     load_in_4bit=True,
...     bnb_4bit_compute_dtype=torch.float16,
... )

>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> model = IdeficsForVisionText2Text.from_pretrained(
...     checkpoint,
...     quantization_config=quantization_config,
...     device_map="auto"
... )

现在您已经以建议的方式之一加载了模型，让我们继续探索您可以使用 IDEFICS 的任务。

图像加标题

图像加标题是预测给定图像的标题的任务。一个常见的应用是帮助视障人士在不同情况下导航，例如，在线探索图像内容。

为了说明任务，获取一个需要加标题的图像，例如：

花园里的小狗的图片

照片由Hendo Wang拍摄。

IDEFICS 接受文本和图像提示。但是，要为图像添加字幕，您不必向模型提供文本提示，只需提供预处理后的输入图像。没有文本提示，模型将从 BOS（序列开始）标记开始生成文本，从而创建字幕。

作为模型的图像输入，您可以使用图像对象（PIL.Image）或可以从中检索图像的 url。

>>> prompt = [
...     "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80",
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
A puppy in a flower bed

在调用generate时，最好包含bad_words_ids，以避免在增加max_new_tokens时出现错误：当模型要生成一个新的<image>或<fake_token_around_image>标记时，而模型没有生成图像时，会出现错误。您可以像本指南中那样即时设置它，或者像文本生成策略指南中描述的那样存储在GenerationConfig中。

提示的图像字幕

您可以通过提供文本提示来扩展图像字幕，模型将继续给出图像。让我们拿另一张图片来说明：

夜晚的埃菲尔铁塔的图片

照片由Denys Nevozhai拍摄。

文本和图像提示可以作为单个列表传递给模型的处理器，以创建适当的输入。

>>> prompt = [
...     "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...     "This is an image of ",
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
This is an image of the Eiffel Tower in Paris, France.

少量提示

虽然 IDEFICS 展示了出色的零-shot 结果，但您的任务可能需要一定格式的字幕，或者伴随其他限制或要求，增加任务的复杂性。少量提示可用于启用上下文学习。通过在提示中提供示例，您可以引导模型生成类似于给定示例格式的结果。

让我们以埃菲尔铁塔的上一张图片作为模型的示例，并构建一个提示，向模型展示除了学习图像中的对象是什么之外，我们还希望获得一些有趣的信息。然后，让我们看看，如果我们可以为自由女神像的图片获得相同的响应格式：

自由女神像的图片

照片由Juan Mayobre拍摄。

>>> prompt = ["User:",
...            "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...            "Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n",
...            "User:",
...            "https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80",
...            "Describe this image.\nAssistant:"
...            ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
User: Describe this image.
Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building. 
User: Describe this image.
Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall.

请注意，仅从单个示例（即 1-shot）中，模型已经学会了如何执行任务。对于更复杂的任务，请随时尝试使用更多的示例（例如 3-shot，5-shot 等）。

视觉问题回答

视觉问题回答（VQA）是根据图像回答开放式问题的任务。与图像字幕类似，它可以用于辅助功能应用程序，还可以用于教育（关于视觉材料的推理）、客户服务（基于图像的产品问题）和图像检索。

让我们为这个任务获取一张新的图片：

一对正在野餐的夫妇的图片

照片由Jarritos Mexican Soda拍摄。

您可以通过适当的指示将模型从图像字幕转向视觉问题回答：

>>> prompt = [
...     "Instruction: Provide an answer to the question. Use the image to answer.\n",
...     "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...     "Question: Where are these people and what's the weather like? Answer:"
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Provide an answer to the question. Use the image to answer.
 Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day.

图像分类

IDEFICS 能够将图像分类为不同的类别，而无需明确在包含来自这些特定类别的标记示例的数据上进行训练。给定一组类别并利用其图像和文本理解能力，模型可以推断图像可能属于哪个类别。

假设我们有这样一个蔬菜摊的图片：

蔬菜摊的图片

照片由Peter Wendt拍摄。

我们可以指示模型将图像分类为我们拥有的类别之一：

>>> categories = ['animals','vegetables', 'city landscape', 'cars', 'office']
>>> prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n",
...     "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",    
...     "Category: "
... ]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=6, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0])
Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office'].
Category: Vegetables

在上面的示例中，我们指示模型将图像分类为单个类别，但是，您也可以提示模型进行排名分类。

图像引导的文本生成

对于更有创意的应用，您可以使用基于图像的文本生成来根据图像生成文本。这可以用于创建产品描述、广告、场景描述等。

让我们提示 IDEFICS 根据一扇红门的简单图像撰写一个故事：

一扇红门上有一个南瓜的图片

照片由Craig Tidball提供。

>>> prompt = ["Instruction: Use the image to write a story. \n",
...     "https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80",
...     "Story: \n"]

>>> inputs = processor(prompt, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> print(generated_text[0]) 
Instruction: Use the image to write a story. 
 Story: 
Once upon a time, there was a little girl who lived in a house with a red door.  She loved her red door.  It was the prettiest door in the whole world.

One day, the little girl was playing in her yard when she noticed a man standing on her doorstep.  He was wearing a long black coat and a top hat.

The little girl ran inside and told her mother about the man.

Her mother said, “Don’t worry, honey.  He’s just a friendly ghost.”

The little girl wasn’t sure if she believed her mother, but she went outside anyway.

When she got to the door, the man was gone.

The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep.

He was wearing a long black coat and a top hat.

The little girl ran

看起来 IDEFICS 注意到了门廊上的南瓜，并选择了一个关于鬼魂的恐怖万圣节故事。

对于像这样的较长输出，您将受益于调整文本生成策略。这可以帮助您显着提高生成输出的质量。查看文本生成策略以了解更多信息。

批量模式下运行推理

之前的所有部分都展示了 IDEFICS 的一个示例。以非常相似的方式，您可以通过传递提示列表来为一批示例运行推理：

>>> prompts = [
...     [   "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80",
...         "This is an image of ",
...     ],
...     [   "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...         "This is an image of ",
...     ],
...     [   "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80",
...         "This is an image of ",
...     ],
... ]

>>> inputs = processor(prompts, return_tensors="pt").to("cuda")
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i,t in enumerate(generated_text):
...     print(f"{i}:\n{t}\n") 
0:
This is an image of the Eiffel Tower in Paris, France.

1:
This is an image of a couple on a picnic blanket.

2:
This is an image of a vegetable stand.

用于会话使用的 IDEFICS 指导

对于会话使用情况，您可以在🤗 Hub 上找到模型的经过微调的指导版本：HuggingFaceM4/idefics-80b-instruct和HuggingFaceM4/idefics-9b-instruct。

这些检查点是在混合监督和指导微调数据集上对各自基本模型进行微调的结果，这可以提高下游性能，同时使模型在会话设置中更易于使用。

会话使用和提示与使用基本模型非常相似：

>>> import torch
>>> from transformers import IdeficsForVisionText2Text, AutoProcessor

>>> device = "cuda" if torch.cuda.is_available() else "cpu"

>>> checkpoint = "HuggingFaceM4/idefics-9b-instruct"
>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
>>> processor = AutoProcessor.from_pretrained(checkpoint)

>>> prompts = [
...     [
...         "User: What is in this image?",
...         "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
...         "<end_of_utterance>",

...         "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>",

...         "\nUser:",
...         "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
...         "And who is that?<end_of_utterance>",

...         "\nAssistant:",
...     ],
... ]

>>> # --batched mode
>>> inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device)
>>> # --single sample mode
>>> # inputs = processor(prompts[0], return_tensors="pt").to(device)

>>> # Generation args
>>> exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids
>>> bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

>>> generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
>>> for i, t in enumerate(generated_text):
...     print(f"{i}:\n{t}\n")

LLM 提示指南

原文链接：huggingface.co/docs/transformers/v4.37.2/en/tasks/prompting

像 Falcon、LLaMA 等大型语言模型是预训练的变压器模型，最初训练用于预测给定一些输入文本的下一个标记。它们通常具有数十亿个参数，并且已经在长时间内训练了数万亿个标记。因此，这些模型变得非常强大和多功能，您可以通过用自然语言提示指导模型来解决多个 NLP 任务。

设计这样的提示以确保最佳输出通常被称为“提示工程”。提示工程是一个需要大量实验的迭代过程。自然语言比编程语言更加灵活和表达丰富，但也可能引入一些歧义。同时，自然语言中的提示对变化非常敏感。即使提示中进行轻微修改也可能导致截然不同的输出。

虽然没有确切的配方可以创建适用于所有情况的提示，但研究人员已经制定出一些最佳实践，有助于更一致地实现最佳结果。

本指南涵盖了提示工程的最佳实践，以帮助您制作更好的 LLM 提示并解决各种 NLP 任务。您将学到：

提示的基础知识
LLM 提示的最佳实践
高级提示技术：少样本提示和思维链
何时进行微调而不是提示

提示工程仅是 LLM 输出优化过程的一部分。另一个重要组成部分是选择最佳的文本生成策略。您可以自定义 LLM 在生成文本时如何选择每个后续标记，而无需修改任何可训练参数。通过调整文本生成参数，您可以减少生成文本中的重复，并使其更连贯和更具人类声音。文本生成策略和参数超出了本指南的范围，但您可以在以下指南中了解更多相关主题：

使用 LLM 进行生成
文本生成策略

提示的基础知识

模型类型

现代 LLM 大多数是仅解码器的变压器。一些例子包括：LLaMA, Llama2, Falcon, GPT2。但是，您也可能遇到编码器-解码器变压器 LLM，例如 Flan-T5 和 BART。

编码器-解码器风格的模型通常用于生成任务，其中输出严重依赖于输入，例如翻译和总结。解码器模型用于所有其他类型的生成任务。

在使用管道生成 LLM 文本时，了解您正在使用的 LLM 类型很重要，因为它们使用不同的管道。

使用text-generation管道运行仅解码器模型的推理：

>>> from transformers import pipeline
>>> import torch

>>> torch.manual_seed(0)
>>> generator = pipeline('text-generation', model = 'gpt2')
>>> prompt = "Hello, I'm a language model"

>>> generator(prompt, max_length = 30)
[{'generated_text': "Hello, I'm a language model expert, so I'm a big believer in the concept that I know very well and then I try to look into"}]

要使用编码器-解码器进行推理，请使用text2text-generation管道：

>>> text2text_generator = pipeline("text2text-generation", model = 'google/flan-t5-base')
>>> prompt = "Translate from English to French: I'm very happy to see you"

>>> text2text_generator(prompt)
[{'generated_text': 'Je suis très heureuse de vous rencontrer.'}]

基础版 vs 指导/聊天版模型

🤗 Hub 上提供的大多数最新 LLM 检查点都有两个版本：基础版和指导版（或聊天版）。例如，tiiuae/falcon-7b 和 tiiuae/falcon-7b-instruct。

基础模型在给定初始提示时完成文本的能力非常出色，但是它们并不适合需要遵循指令或用于对话的 NLP 任务。这就是指导（聊天）版本的用武之地。这些检查点是在预训练基础版本上进一步微调指令和对话数据的结果。这种额外的微调使它们成为许多 NLP 任务的更好选择。

让我们举例说明一些简单的提示，您可以使用tiiuae/falcon-7b-instruct来解决一些常见的 NLP 任务。

自然语言处理任务

首先，让我们设置环境：

pip install -q transformers accelerate

接下来，让我们使用适当的管道（"text-generation"）加载模型：

>>> from transformers import pipeline, AutoTokenizer
>>> import torch

>>> torch.manual_seed(0)
>>> model = "tiiuae/falcon-7b-instruct"

>>> tokenizer = AutoTokenizer.from_pretrained(model)
>>> pipe = pipeline(
...     "text-generation",
...     model=model,
...     tokenizer=tokenizer,
...     torch_dtype=torch.bfloat16,
...     device_map="auto",
... )

请注意，Falcon 模型是使用bfloat16数据类型训练的，因此我们建议您也使用相同的数据类型。这需要一个最新版本的 CUDA，并且在现代显卡上效果最佳。

现在我们已经通过管道加载了模型，让我们探讨如何使用提示来解决 NLP 任务。

文本分类

文本分类中最常见的形式之一是情感分析，它为一段文本分配一个标签，比如“积极”、“消极”或“中性”。让我们编写一个提示，指示模型对给定的文本（电影评论）进行分类。我们将从给出指令开始，然后指定要分类的文本。请注意，我们不仅仅止步于此，还添加了响应的开头 - "情感："：

>>> torch.manual_seed(0)
>>> prompt = """Classify the text into neutral, negative or positive. 
... Text: This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen.
... Sentiment:
... """

>>> sequences = pipe(
...     prompt,
...     max_new_tokens=10,
... )

>>> for seq in sequences:
...     print(f"Result: {seq['generated_text']}")
Result: Classify the text into neutral, negative or positive. 
Text: This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen.
Sentiment:
Positive

因此，输出包含了我们在指令中提供的列表中的一个分类标签，而且是正确的！

您可能注意到，除了提示之外，我们还传递了一个max_new_tokens参数。它控制模型应该生成的标记数量，这是您可以在文本生成策略指南中了解的许多文本生成参数之一。

命名实体识别

命名实体识别（NER）是在文本中找到命名实体的任务，比如人物、地点或组织。让我们修改提示中的指令，让 LLM 执行这个任务。在这里，我们还设置return_full_text = False，这样输出就不包含提示了：

>>> torch.manual_seed(1)
>>> prompt = """Return a list of named entities in the text.
... Text: The Golden State Warriors are an American professional basketball team based in San Francisco.
... Named entities:
... """

>>> sequences = pipe(
...     prompt,
...     max_new_tokens=15,
...     return_full_text = False,    
... )

>>> for seq in sequences:
...     print(f"{seq['generated_text']}")
- Golden State Warriors
- San Francisco

正如您所看到的，模型正确识别了给定文本中的两个命名实体。

翻译

LLM 可以执行的另一个任务是翻译。您可以选择使用编码器-解码器模型来执行此任务，但是在这里，为了简化示例，我们将继续使用 Falcon-7b-instruct，它做得相当不错。再次，这是您如何编写一个基本提示，指示模型将一段文本从英语翻译成意大利语：

>>> torch.manual_seed(2)
>>> prompt = """Translate the English text to Italian.
... Text: Sometimes, I've believed as many as six impossible things before breakfast.
... Translation:
... """

>>> sequences = pipe(
...     prompt,
...     max_new_tokens=20,
...     do_sample=True,
...     top_k=10,
...     return_full_text = False,
... )

>>> for seq in sequences:
...     print(f"{seq['generated_text']}")
A volte, ho creduto a sei impossibili cose prima di colazione.

在这里，我们添加了do_sample=True和top_k=10，以允许模型在生成输出时更加灵活。

文本摘要

与翻译类似，文本摘要是另一个生成任务，输出严重依赖于输入，编码器-解码器模型可能是更好的选择。然而，解码器风格的模型也可以用于这个任务。以前，我们将指令放在提示的开头。然而，提示的最后也可以是一个合适的位置来放置指令。通常，最好将指令放在两端之一。

>>> torch.manual_seed(3)
>>> prompt = """Permaculture is a design process mimicking the diversity, functionality and resilience of natural ecosystems. The principles and practices are drawn from traditional ecological knowledge of indigenous cultures combined with modern scientific understanding and technological innovations. Permaculture design provides a framework helping individuals and communities develop innovative, creative and effective strategies for meeting basic needs while preparing for and mitigating the projected impacts of climate change.
... Write a summary of the above text.
... Summary:
... """

>>> sequences = pipe(
...     prompt,
...     max_new_tokens=30,
...     do_sample=True,
...     top_k=10,
...     return_full_text = False,
... )

>>> for seq in sequences:
...     print(f"{seq['generated_text']}")
Permaculture is an ecological design mimicking natural ecosystems to meet basic needs and prepare for climate change. It is based on traditional knowledge and scientific understanding.

问答

对于问答任务，我们可以将提示结构化为以下逻辑组件：指令、上下文、问题和引导词或短语（"Answer:"），以促使模型开始生成答案：

>>> torch.manual_seed(4)
>>> prompt = """Answer the question using the context below.
... Context: Gazpacho is a cold soup and drink made of raw, blended vegetables. Most gazpacho includes stale bread, tomato, cucumbers, onion, bell peppers, garlic, olive oil, wine vinegar, water, and salt. Northern recipes often include cumin and/or pimentón (smoked sweet paprika). Traditionally, gazpacho was made by pounding the vegetables in a mortar with a pestle; this more laborious method is still sometimes used as it helps keep the gazpacho cool and avoids the foam and silky consistency of smoothie versions made in blenders or food processors.
... Question: What modern tool is used to make gazpacho?
... Answer:
... """

>>> sequences = pipe(
...     prompt,
...     max_new_tokens=10,
...     do_sample=True,
...     top_k=10,
...     return_full_text = False,
... )

>>> for seq in sequences:
...     print(f"Result: {seq['generated_text']}")
Result: Modern tools are used, such as immersion blenders

推理

推理是 LLM 中最困难的任务之一，要取得良好的结果通常需要应用高级提示技术，比如思维链。

让我们尝试看看我们是否可以让模型通过一个基本提示来推理一个简单的算术任务：

>>> torch.manual_seed(5)
>>> prompt = """There are 5 groups of students in the class. Each group has 4 students. How many students are there in the class?"""

>>> sequences = pipe(
...     prompt,
...     max_new_tokens=30,
...     do_sample=True,
...     top_k=10,
...     return_full_text = False,
... )

>>> for seq in sequences:
...     print(f"Result: {seq['generated_text']}")
Result: 
There are a total of 5 groups, so there are 5 x 4=20 students in the class.

正确！让我们稍微增加一点复杂性，看看我们是否仍然可以通过一个基本提示来完成：

>>> torch.manual_seed(6)
>>> prompt = """I baked 15 muffins. I ate 2 muffins and gave 5 muffins to a neighbor. My partner then bought 6 more muffins and ate 2\. How many muffins do we now have?"""

>>> sequences = pipe(
...     prompt,
...     max_new_tokens=10,
...     do_sample=True,
...     top_k=10,
...     return_full_text = False,
... )

>>> for seq in sequences:
...     print(f"Result: {seq['generated_text']}")
Result: 
The total number of muffins now is 21

这是一个错误答案，应该是 12。在这种情况下，这可能是因为提示过于基础，或者是因为模型选择不当，毕竟我们选择了 Falcon 的最小版本。对于所有大小的模型来说，推理都是困难的，但更大的模型可能表现更好。

LLM 提示的最佳实践

在本指南的这一部分中，我们编制了一份倾向于改善提示结果的最佳实践清单：

在选择要使用的模型时，最新和最有能力的模型可能表现更好。
从一个简单而短的提示开始，然后逐步迭代。
将指令放在提示的开头或最后。在处理大量上下文时，模型会应用各种优化措施，以防止注意力复杂度呈二次方增长。这可能会使模型更加关注提示的开头或结尾，而不是中间部分。
将指令与其适用的文本清晰分开-更多内容请参见下一节。
对任务和期望结果进行具体和描述性的说明-其格式、长度、风格、语言等。
避免模棱两可的描述和指令。
更倾向于说“要做什么”而不是说“不要做什么”的指令。
通过编写第一个单词（甚至开始第一个句子）来“引导”输出朝着正确方向发展。
使用高级技术，如少样本提示和思维链
使用不同模型测试您的提示，以评估其稳健性。
版本和跟踪提示的性能。

高级提示技术

少样本提示

上述部分的基本提示是“零样本”提示的示例，这意味着模型已经获得了指令和上下文，但没有带有解决方案的示例。通常在指令数据集上进行微调的 LLM 在这种“零样本”任务上表现良好。然而，您可能会发现您的任务更加复杂或微妙，也许您对模型没有从指令中捕捉到的输出有一些要求。在这种情况下，您可以尝试称为少样本提示的技术。

在少样本提示中，我们在提示中提供示例，为模型提供更多上下文以提高性能。这些示例会让模型生成遵循示例模式的输出。

这里有一个例子：

>>> torch.manual_seed(0)
>>> prompt = """Text: The first human went into space and orbited the Earth on April 12, 1961.
... Date: 04/12/1961
... Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon. 
... Date:"""

>>> sequences = pipe(
...     prompt,
...     max_new_tokens=8,
...     do_sample=True,
...     top_k=10,
... )

>>> for seq in sequences:
...     print(f"Result: {seq['generated_text']}")
Result: Text: The first human went into space and orbited the Earth on April 12, 1961.
Date: 04/12/1961
Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon. 
Date: 09/28/1960

在上面的代码片段中，我们使用了一个示例来向模型展示所需的输出，因此这可以称为“一次性”提示。然而，根据任务的复杂性，您可能需要使用多个示例。

少样本提示技术的局限性：

虽然 LLM 可以捕捉到示例中的模式，但这些技术在复杂的推理任务上效果不佳
少样本提示需要创建较长的提示。具有大量标记的提示可能会增加计算和延迟。提示的长度也有限制。
有时，当给定多个示例时，模型可能会学习您并非打算让它学习的模式，例如第三个电影评论总是负面的。

思维链

思维链（CoT）提示是一种技术，它促使模型产生中间推理步骤，从而提高复杂推理任务的结果。

有两种方法可以引导模型产生推理步骤：

通过用详细答案说明示例来进行少样本提示，向模型展示如何解决问题。
通过添加短语，如“让我们一步一步地思考”或“深呼吸，一步一步地解决问题”，指导模型进行推理。

如果我们将 CoT 技术应用于推理部分中的松饼示例，并使用更大的模型，例如（tiiuae/falcon-180B-chat），您可以在HuggingChat中尝试，我们将在推理结果上获得显著的改进：

Let's go through this step-by-step:
1\. You start with 15 muffins.
2\. You eat 2 muffins, leaving you with 13 muffins.
3\. You give 5 muffins to your neighbor, leaving you with 8 muffins.
4\. Your partner buys 6 more muffins, bringing the total number of muffins to 14.
5\. Your partner eats 2 muffins, leaving you with 12 muffins.
Therefore, you now have 12 muffins.

提示 vs 微调

通过优化您的提示，您可以取得出色的结果，但是您可能仍然在考虑是否微调模型对您的情况更有效。以下是一些微调较小模型可能是首选的情况：

您的领域与 LLMs 预先训练的领域大相径庭，广泛的提示优化并未产生足够的结果。
您需要您的模型在资源稀缺的语言中表现良好。
您需要训练模型的数据是受严格监管的敏感数据。
由于成本、隐私、基础设施或其他限制，您必须使用小型模型。

在上述所有示例中，您需要确保您已经拥有或可以轻松获得足够大的领域特定数据集，以合理的成本来微调模型。您还需要有足够的时间和资源来微调模型。

如果上述示例不适用于您，优化提示可能会更有益。

开发者指南

使用🤗 Tokenizers 中的分词器

原始文本：huggingface.co/docs/transformers/v4.37.2/en/fast_tokenizers

PreTrainedTokenizerFast 依赖于 🤗 Tokenizers 库。从🤗 Tokenizers 库获得的分词器可以非常简单地加载到🤗 Transformers 中。

在进入具体内容之前，让我们首先通过几行代码创建一个虚拟的分词器：

>>> from tokenizers import Tokenizer
>>> from tokenizers.models import BPE
>>> from tokenizers.trainers import BpeTrainer
>>> from tokenizers.pre_tokenizers import Whitespace

>>> tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
>>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

>>> tokenizer.pre_tokenizer = Whitespace()
>>> files = [...]
>>> tokenizer.train(files, trainer)

我们现在有一个在我们定义的文件上训练过的分词器。我们可以继续在该运行时中使用它，或者将其保存到一个 JSON 文件中以供将来重复使用。

直接从分词器对象加载

让我们看看如何在🤗 Transformers 库中利用这个分词器对象。PreTrainedTokenizerFast 类允许通过接受实例化的 tokenizer 对象作为参数来轻松实例化：

>>> from transformers import PreTrainedTokenizerFast

>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

这个对象现在可以与🤗 Transformers 分词器共享的所有方法一起使用！请前往分词器页面获取更多信息。

从一个 JSON 文件加载

为了从一个 JSON 文件中加载一个分词器，让我们首先保存我们的分词器：

>>> tokenizer.save("tokenizer.json")

我们保存这个文件的路径可以通过 tokenizer_file 参数传递给 PreTrainedTokenizerFast 初始化方法：

>>> from transformers import PreTrainedTokenizerFast

>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json")

这个对象现在可以与🤗 Transformers 分词器共享的所有方法一起使用！请前往分词器页面获取更多信息。

用于推理的多语言模型

原文链接：huggingface.co/docs/transformers/v4.37.2/en/multilingual

🤗 Transformers 中有几个多语言模型，它们的推理用法与单语模型不同。不过，并非所有多语言模型的用法都不同。一些模型，如bert-base-multilingual-uncased，可以像单语模型一样使用。本指南将向您展示如何使用推理中用法不同的多语言模型。

XLM

XLM 有十个不同的检查点，其中只有一个是单语的。剩下的九个模型检查点可以分为两类：使用语言嵌入和不使用语言嵌入的检查点。

带有语言嵌入的 XLM

以下 XLM 模型使用语言嵌入来指定推理中使用的语言：

xlm-mlm-ende-1024（掩码语言建模，英语-德语）
xlm-mlm-enfr-1024（掩码语言建模，英语-法语）
xlm-mlm-enro-1024（掩码语言建模，英语-罗马尼亚语）
xlm-mlm-xnli15-1024（掩码语言建模，XNLI 语言）
xlm-mlm-tlm-xnli15-1024（掩码语言建模+翻译，XNLI 语言）
xlm-clm-enfr-1024（因果语言建模，英语-法语）
xlm-clm-ende-1024（因果语言建模，英语-德语）

语言嵌入表示为与传递给模型的input_ids相同形状的张量。这些张量中的值取决于使用的语言，并由标记器的lang2id和id2lang属性识别。

在这个示例中，加载xlm-clm-enfr-1024检查点（因果语言建模，英语-法语）：

>>> import torch
>>> from transformers import XLMTokenizer, XLMWithLMHeadModel

>>> tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024")
>>> model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024")

标记器的lang2id属性显示了该模型的语言及其 ID：

>>> print(tokenizer.lang2id)
{'en': 0, 'fr': 1}

接下来，创建一个示例输入：

>>> input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")])  # batch size of 1

将语言 ID 设置为"en"，并用它来定义语言嵌入。语言嵌入是一个填充了0的张量，因为这是英语的语言 ID。这个张量应该与input_ids的大小相同。

>>> language_id = tokenizer.lang2id["en"]  # 0
>>> langs = torch.tensor([language_id] * input_ids.shape[1])  # torch.tensor([0, 0, 0, ..., 0])

>>> # We reshape it to be of size (batch_size, sequence_length)
>>> langs = langs.view(1, -1)  # is now of shape [1, sequence_length] (we have a batch size of 1)

现在您可以将input_ids和语言嵌入传递给模型：

>>> outputs = model(input_ids, langs=langs)

run_generation.py脚本可以使用xlm-clm检查点生成带有语言嵌入的文本。

没有语言嵌入的 XLM

以下 XLM 模型在推理过程中不需要语言嵌入：

xlm-mlm-17-1280（掩码语言建模，17 种语言）
xlm-mlm-100-1280（掩码语言建模，100 种语言）

这些模型用于通用句子表示，不同于之前的 XLM 检查点。

BERT

以下 BERT 模型可用于多语言任务：

bert-base-multilingual-uncased（掩码语言建模+下一句预测，102 种语言）
bert-base-multilingual-cased（掩码语言建模+下一句预测，104 种语言）

这些模型在推理过程中不需要语言嵌入。它们应该根据上下文识别语言并相应地推断。

XLM-RoBERTa

以下 XLM-RoBERTa 模型可用于多语言任务：

xlm-roberta-base（掩码语言建模，100 种语言）
xlm-roberta-large（掩码语言建模，100 种语言）

XLM-RoBERTa 在 100 种语言中新创建和清理的 2.5TB CommonCrawl 数据上进行了训练。在分类、序列标记和问题回答等下游任务上，它比以前发布的多语言模型如 mBERT 或 XLM 提供了强大的性能提升。

M2M100

以下 M2M100 模型可用于多语言翻译：

facebook/m2m100_418M（翻译）
facebook/m2m100_1.2B（翻译）

在这个示例中，加载facebook/m2m100_418M检查点以将中文翻译成英文。您可以在标记器中设置源语言：

>>> from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

>>> en_text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger."
>>> chinese_text = "不要插手巫師的事務, 因為他們是微妙的, 很快就會發怒."

>>> tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M", src_lang="zh")
>>> model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")

对文本进行标记化：

>>> encoded_zh = tokenizer(chinese_text, return_tensors="pt")

M2M100 强制将目标语言 ID 作为第一个生成的标记以翻译为目标语言。在generate方法中将forced_bos_token_id设置为en以翻译为英语：

>>> generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
>>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
'Do not interfere with the matters of the witches, because they are delicate and will soon be angry.'

MBart

以下 MBart 模型可用于多语言翻译：

facebook/mbart-large-50-one-to-many-mmt（一对多多语言机器翻译，50 种语言）
facebook/mbart-large-50-many-to-many-mmt（多对多多语言机器翻译，50 种语言）
facebook/mbart-large-50-many-to-one-mmt（多对一多语言机器翻译，50 种语言）
facebook/mbart-large-50（多语言翻译，50 种语言）
facebook/mbart-large-cc25

在此示例中，加载facebook/mbart-large-50-many-to-many-mmt检查点以将芬兰语翻译为英语。您可以在标记器中设置源语言：

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

>>> en_text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger."
>>> fi_text = "Älä sekaannu velhojen asioihin, sillä ne ovat hienovaraisia ja nopeasti vihaisia."

>>> tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50-many-to-many-mmt", src_lang="fi_FI")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

对文本进行标记化：

>>> encoded_en = tokenizer(en_text, return_tensors="pt")

MBart 强制将目标语言 ID 作为第一个生成的标记以翻译为目标语言。在generate方法中将forced_bos_token_id设置为en以翻译为英语：

>>> generated_tokens = model.generate(**encoded_en, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
>>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
"Don't interfere with the wizard's affairs, because they are subtle, will soon get angry."

如果您正在使用facebook/mbart-large-50-many-to-one-mmt检查点，则不需要强制目标语言 ID 作为第一个生成的标记，否则用法相同。

创建自定义架构

原文链接: huggingface.co/docs/transformers/v4.37.2/en/create_a_model

AutoClass会自动推断模型架构并下载预训练配置和权重。通常，我们建议使用AutoClass生成与检查点无关的代码。但是，希望对特定模型参数有更多控制的用户可以从几个基类创建一个自定义🤗 Transformers 模型。这对于任何对研究、训练或实验🤗 Transformers 模型感兴趣的人特别有用。在本指南中，深入了解如何创建一个自定义模型而不使用AutoClass。学习如何：

加载并自定义模型配置。
创建模型架构。
为文本创建慢速和快速分词器。
为视觉任务创建图像处理器。
为音频任务创建特征提取器。
为多模态任务创建处理器。

配置

configuration 指的是模型的特定属性。每个模型配置都有不同的属性；例如，所有 NLP 模型都共有hidden_size、num_attention_heads、num_hidden_layers和vocab_size属性。这些属性指定了构建模型所需的注意力头或隐藏层的数量。

通过访问 DistilBertConfig 来查看其属性，进一步了解 DistilBERT：

>>> from transformers import DistilBertConfig

>>> config = DistilBertConfig()
>>> print(config)
DistilBertConfig {
  "activation": "gelu",
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.16.2",
  "vocab_size": 30522
}

DistilBertConfig 显示了用于构建基本 DistilBertModel 的所有默认属性。所有属性都是可定制的，为实验创造了空间。例如，您可以自定义一个默认模型：

使用activation参数尝试不同的激活函数。
使用attention_dropout参数为注意力概率设置更高的 dropout 比率。

>>> my_config = DistilBertConfig(activation="relu", attention_dropout=0.4)
>>> print(my_config)
DistilBertConfig {
  "activation": "relu",
  "attention_dropout": 0.4,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "transformers_version": "4.16.2",
  "vocab_size": 30522
}

预训练模型属性可以在 from_pretrained()函数中修改：

>>> my_config = DistilBertConfig.from_pretrained("distilbert-base-uncased", activation="relu", attention_dropout=0.4)

一旦您满意您的模型配置，您可以用 save_pretrained()保存它。您的配置文件将以 JSON 文件的形式存储在指定的保存目录中：

>>> my_config.save_pretrained(save_directory="./your_model_save_path")

要重用配置文件，请使用 from_pretrained()加载它：

>>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/config.json")

您还可以将配置文件保存为字典，甚至只保存自定义配置属性与默认配置属性之间的差异！查看 configuration 文档以获取更多详细信息。

模型

下一步是创建一个 model。模型 - 也宽泛地称为架构 - 定义了每个层正在做什么以及正在发生的操作。像num_hidden_layers这样的配置属性用于定义架构。每个模型都共享基类 PreTrainedModel 和一些常见方法，如调整输入嵌入和修剪自注意力头。此外，所有模型也是torch.nn.Module、tf.keras.Model或flax.linen.Module的子类。这意味着模型与各自框架的使用是兼容的。

Pytorch 隐藏 Pytorch 内容

将您的自定义配置属性加载到模型中：

>>> from transformers import DistilBertModel

>>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/config.json")
>>> model = DistilBertModel(my_config)

这将创建一个具有随机值而不是预训练权重的模型。在训练之前，您无法将此模型用于任何有用的目的。训练是一个昂贵且耗时的过程。通常最好使用预训练模型以更快地获得更好的结果，同时仅使用训练所需资源的一小部分。

使用 from_pretrained()创建一个预训练模型：

>>> model = DistilBertModel.from_pretrained("distilbert-base-uncased")

当加载预训练权重时，如果模型由🤗 Transformers 提供，则默认模型配置会自动加载。但是，如果您愿意，仍然可以替换 - 一些或全部 - 默认模型配置属性为您自己的属性：

>>> model = DistilBertModel.from_pretrained("distilbert-base-uncased", config=my_config)

TensorFlow 隐藏 TensorFlow 内容

将您的自定义配置属性加载到模型中：

>>> from transformers import TFDistilBertModel

>>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/my_config.json")
>>> tf_model = TFDistilBertModel(my_config)

使用 from_pretrained()创建一个预训练模型：

>>> tf_model = TFDistilBertModel.from_pretrained("distilbert-base-uncased")

>>> tf_model = TFDistilBertModel.from_pretrained("distilbert-base-uncased", config=my_config)

模型头

此时，您有一个输出隐藏状态的基础 DistilBERT 模型。隐藏状态作为输入传递给模型头，以产生最终输出。只要模型支持任务（即，您不能将 DistilBERT 用于像翻译这样的序列到序列任务），🤗 Transformers 为每个任务提供不同的模型头。

Pytorch 隐藏 Pytorch 内容

例如，DistilBertForSequenceClassification 是一个带有序列分类头的基础 DistilBERT 模型。序列分类头是在汇总输出之上的线性层。

>>> from transformers import DistilBertForSequenceClassification

>>> model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

通过切换到不同的模型头轻松地为另一个任务重用此检查点。对于问答任务，您将使用 DistilBertForQuestionAnswering 模型头。问答头类似于序列分类头，只是它是在隐藏状态输出之上的线性层。

>>> from transformers import DistilBertForQuestionAnswering

>>> model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")

TensorFlow 隐藏 TensorFlow 内容

例如，TFDistilBertForSequenceClassification 是一个带有序列分类头的基础 DistilBERT 模型。序列分类头是在汇总输出之上的线性层。

>>> from transformers import TFDistilBertForSequenceClassification

>>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

通过切换到不同的模型头轻松地为另一个任务重用此检查点。对于问答任务，您将使用 TFDistilBertForQuestionAnswering 模型头。问答头类似于序列分类头，只是它是在隐藏状态输出之上的线性层。

>>> from transformers import TFDistilBertForQuestionAnswering

>>> tf_model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")

Tokenizer

在使用模型处理文本数据之前，您需要的最后一个基类是 tokenizer 将原始文本转换为张量。您可以使用🤗 Transformers 的两种类型的 tokenizer：

PreTrainedTokenizer：一个 tokenizer 的 Python 实现。
PreTrainedTokenizerFast：来自我们基于 Rust 的🤗 Tokenizer库的一个分词器。这种分词器类型速度明显更快 - 特别是在批量分词时 - 这是由于其 Rust 实现。快速分词器还提供了额外的方法，比如偏移映射，将标记映射到它们的原始单词或字符。

两种分词器都支持常见方法，如编码和解码、添加新标记和管理特殊标记。

并非每个模型都支持快速分词器。查看这个 table 以检查模型是否支持快速分词器。

如果您训练了自己的分词器，可以从您的vocabulary文件创建一个：

>>> from transformers import DistilBertTokenizer

>>> my_tokenizer = DistilBertTokenizer(vocab_file="my_vocab_file.txt", do_lower_case=False, padding_side="left")

重要的是要记住，自定义分词器的词汇表与预训练模型的分词器生成的词汇表是不同的。如果您使用预训练模型，则需要使用预训练模型的词汇表，否则输入将没有意义。使用 DistilBertTokenizer 类创建一个带有预训练模型词汇表的分词器：

>>> from transformers import DistilBertTokenizer

>>> slow_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

使用 DistilBertTokenizerFast 类创建一个快速分词器：

>>> from transformers import DistilBertTokenizerFast

>>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

默认情况下，AutoTokenizer 将尝试加载一个快速分词器。您可以通过在from_pretrained中设置use_fast=False来禁用此行为。

图像处理器

图像处理器处理视觉输入。它继承自基础 ImageProcessingMixin 类。

要使用，创建一个与您正在使用的模型相关联的图像处理器。例如，如果您正在使用 ViT 进行图像分类，则创建一个默认的 ViTImageProcessor：

>>> from transformers import ViTImageProcessor

>>> vit_extractor = ViTImageProcessor()
>>> print(vit_extractor)
ViTImageProcessor {
  "do_normalize": true,
  "do_resize": true,
  "image_processor_type": "ViTImageProcessor",
  "image_mean": [
    0.5,
    0.5,
    0.5
  ],
  "image_std": [
    0.5,
    0.5,
    0.5
  ],
  "resample": 2,
  "size": 224
}

如果您不需要任何自定义，只需使用from_pretrained方法加载模型的默认图像处理器参数。

修改任何 ViTImageProcessor 参数以创建您自定义的图像处理器：

>>> from transformers import ViTImageProcessor

>>> my_vit_extractor = ViTImageProcessor(resample="PIL.Image.BOX", do_normalize=False, image_mean=[0.3, 0.3, 0.3])
>>> print(my_vit_extractor)
ViTImageProcessor {
  "do_normalize": false,
  "do_resize": true,
  "image_processor_type": "ViTImageProcessor",
  "image_mean": [
    0.3,
    0.3,
    0.3
  ],
  "image_std": [
    0.5,
    0.5,
    0.5
  ],
  "resample": "PIL.Image.BOX",
  "size": 224
}

特征提取器

特征提取器处理音频输入。它继承自基础 FeatureExtractionMixin 类，并且还可以继承自 SequenceFeatureExtractor 类来处理音频输入。

要使用，创建一个与您正在使用的模型相关联的特征提取器。例如，如果您正在使用 Wav2Vec2 进行音频分类，则创建一个默认的 Wav2Vec2FeatureExtractor：

>>> from transformers import Wav2Vec2FeatureExtractor

>>> w2v2_extractor = Wav2Vec2FeatureExtractor()
>>> print(w2v2_extractor)
Wav2Vec2FeatureExtractor {
  "do_normalize": true,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": false,
  "sampling_rate": 16000
}

如果您不需要任何自定义，只需使用from_pretrained方法加载模型的默认特征提取器参数。

修改任何 Wav2Vec2FeatureExtractor 参数以创建您自定义的特征提取器：

>>> from transformers import Wav2Vec2FeatureExtractor

>>> w2v2_extractor = Wav2Vec2FeatureExtractor(sampling_rate=8000, do_normalize=False)
>>> print(w2v2_extractor)
Wav2Vec2FeatureExtractor {
  "do_normalize": false,
  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": false,
  "sampling_rate": 8000
}

处理器

对于支持多模态任务的模型，🤗 Transformers 提供了一个处理器类，方便地将处理类（如特征提取器和分词器）包装成一个单一对象。例如，让我们使用 Wav2Vec2Processor 来进行自动语音识别任务（ASR）。ASR 将音频转录为文本，因此您将需要一个特征提取器和一个分词器。

创建一个特征提取器来处理音频输入：

>>> from transformers import Wav2Vec2FeatureExtractor

>>> feature_extractor = Wav2Vec2FeatureExtractor(padding_value=1.0, do_normalize=True)

创建一个分词器来处理文本输入：

>>> from transformers import Wav2Vec2CTCTokenizer

>>> tokenizer = Wav2Vec2CTCTokenizer(vocab_file="my_vocab_file.txt")

将特征提取器和分词器组合在 Wav2Vec2Processor 中：

>>> from transformers import Wav2Vec2Processor

>>> processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

通过两个基本类 - 配置和模型 - 以及额外的预处理类（分词器、图像处理器、特征提取器或处理器），您可以创建🤗 Transformers 支持的任何模型。这些基本类都是可配置的，允许您使用您想要的特定属性。您可以轻松设置一个用于训练的模型或修改现有的预训练模型进行微调。

构建自定义模型

原文链接：huggingface.co/docs/transformers/v4.37.2/en/custom_models

🤗 Transformers 库被设计为易于扩展。每个模型都完全在存储库的给定子文件夹中编码，没有抽象，因此您可以轻松复制一个建模文件并根据需要进行调整。

如果您正在编写全新的模型，从头开始可能更容易。在本教程中，我们将向您展示如何编写自定义模型及其配置，以便可以在 Transformers 中使用，并且您可以与社区共享（以及它所依赖的代码），以便任何人都可以使用它，即使它不在🤗 Transformers 库中。我们将看到如何在 transformers 上构建并扩展框架，使用您的钩子和自定义代码。

我们将在 ResNet 模型上说明所有这些，通过将timm 库的 ResNet 类包装到 PreTrainedModel 中。

编写自定义配置

在深入研究模型之前，让我们先编写其配置。模型的配置是一个对象，其中包含构建模型所需的所有信息。正如我们将在下一节中看到的，模型只能接受一个config进行初始化，因此我们确实需要该对象尽可能完整。

在我们的示例中，我们将获取 ResNet 类的一些参数，可能需要进行调整。然后，不同的配置将给我们不同类型的可能的 ResNets。然后我们只需存储这些参数，之前检查其中一些参数的有效性。

from transformers import PretrainedConfig
from typing import List

class ResnetConfig(PretrainedConfig):
    model_type = "resnet"

    def __init__( self,
        block_type="bottleneck",
        layers: List[int] = [3, 4, 6, 3],
        num_classes: int = 1000,
        input_channels: int = 3,
        cardinality: int = 1,
        base_width: int = 64,
        stem_width: int = 64,
        stem_type: str = "",
        avg_down: bool = False,
        **kwargs, ):
        if block_type not in ["basic", "bottleneck"]:
            raise ValueError(f"`block_type` must be 'basic' or bottleneck', got {block_type}.")
        if stem_type not in ["", "deep", "deep-tiered"]:
            raise ValueError(f"`stem_type` must be '', 'deep' or 'deep-tiered', got {stem_type}.")

        self.block_type = block_type
        self.layers = layers
        self.num_classes = num_classes
        self.input_channels = input_channels
        self.cardinality = cardinality
        self.base_width = base_width
        self.stem_width = stem_width
        self.stem_type = stem_type
        self.avg_down = avg_down
        super().__init__(**kwargs)

编写自定义配置时需要记住的三个重要事项如下：

您必须继承自PretrainedConfig，
您的PretrainedConfig的__init__必须接受任何 kwargs，
这些kwargs需要传递给超类__init__。

继承是为了确保您从🤗 Transformers 库中获得所有功能，而另外两个约束来自于PretrainedConfig拥有比您设置的字段更多。当使用from_pretrained方法重新加载配置时，这些字段需要被您的配置接受，然后发送到超类。

为您的配置定义model_type（这里model_type="resnet"）不是强制性的，除非您希望将您的模型注册到自动类（请参见最后一节）。

完成后，您可以像处理库中任何其他模型配置一样轻松创建和保存您的配置。以下是我们如何创建一个 resnet50d 配置并保存它：

resnet50d_config = ResnetConfig(block_type="bottleneck", stem_width=32, stem_type="deep", avg_down=True)
resnet50d_config.save_pretrained("custom-resnet")

这将在custom-resnet文件夹内保存一个名为config.json的文件。然后，您可以使用from_pretrained方法重新加载您的配置：

resnet50d_config = ResnetConfig.from_pretrained("custom-resnet")

您还可以使用 PretrainedConfig 类的任何其他方法，例如 push_to_hub()直接将您的配置上传到 Hub。

编写自定义模型

现在我们有了 ResNet 配置，我们可以继续编写模型。实际上，我们将编写两个模型：一个从一批图像中提取隐藏特征的模型（类似于 BertModel），一个适用于图像分类的模型（类似于 BertForSequenceClassification）。

如前所述，我们将只编写模型的松散包装，以使示例简单化。在编写此类之前，我们需要做的唯一事情是将块类型与实际块类之间建立映射。然后通过将所有内容传递给ResNet类，从配置中定义模型：

from transformers import PreTrainedModel
from timm.models.resnet import BasicBlock, Bottleneck, ResNet
from .configuration_resnet import ResnetConfig

BLOCK_MAPPING = {"basic": BasicBlock, "bottleneck": Bottleneck}

class ResnetModel(PreTrainedModel):
    config_class = ResnetConfig

    def __init__(self, config):
        super().__init__(config)
        block_layer = BLOCK_MAPPING[config.block_type]
        self.model = ResNet(
            block_layer,
            config.layers,
            num_classes=config.num_classes,
            in_chans=config.input_channels,
            cardinality=config.cardinality,
            base_width=config.base_width,
            stem_width=config.stem_width,
            stem_type=config.stem_type,
            avg_down=config.avg_down,
        )

    def forward(self, tensor):
        return self.model.forward_features(tensor)

对于将对图像进行分类的模型，我们只需更改前向方法：

import torch

class ResnetModelForImageClassification(PreTrainedModel):
    config_class = ResnetConfig

    def __init__(self, config):
        super().__init__(config)
        block_layer = BLOCK_MAPPING[config.block_type]
        self.model = ResNet(
            block_layer,
            config.layers,
            num_classes=config.num_classes,
            in_chans=config.input_channels,
            cardinality=config.cardinality,
            base_width=config.base_width,
            stem_width=config.stem_width,
            stem_type=config.stem_type,
            avg_down=config.avg_down,
        )

    def forward(self, tensor, labels=None):
        logits = self.model(tensor)
        if labels is not None:
            loss = torch.nn.cross_entropy(logits, labels)
            return {"loss": loss, "logits": logits}
        return {"logits": logits}

在这两种情况下，请注意我们如何从PreTrainedModel继承并使用config调用超类初始化（有点像当您编写常规torch.nn.Module时）。设置config_class的行不是强制性的，除非您想将您的模型注册到自动类（请参见最后一节）。

如果您的模型与库中的模型非常相似，您可以重用与该模型相同的配置。

您可以让您的模型返回任何您想要的内容，但是像我们为ResnetModelForImageClassification所做的那样返回一个包含损失的字典，当传递标签时，将使您的模型可以直接在 Trainer 类中使用。只要您打算使用自己的训练循环或其他库进行训练，使用另一种输出格式也是可以的。

现在我们有了我们的模型类，让我们创建一个：

resnet50d = ResnetModelForImageClassification(resnet50d_config)

同样，您可以使用 PreTrainedModel 的任何方法，例如 save_pretrained()或 push_to_hub()。我们将在下一节中使用第二种方法，并看看如何将模型权重与我们模型的代码一起推送。但首先，让我们在模型中加载一些预训练权重。

在您自己的用例中，您可能会在自己的数据上训练自定义模型。为了在本教程中快速进行，我们将使用 resnet50d 的预训练版本。由于我们的模型只是它的一个包装器，所以转移这些权重将会很容易：

import timm

pretrained_model = timm.create_model("resnet50d", pretrained=True)
resnet50d.model.load_state_dict(pretrained_model.state_dict())

现在让我们看看如何确保在执行 save_pretrained()或 push_to_hub()时，模型的代码被保存。

将带有自定义代码的模型注册到自动类

如果您正在编写一个扩展🤗 Transformers 的库，您可能希望扩展自动类以包括您自己的模型。这与将代码推送到 Hub 不同，用户需要导入您的库才能获取自定义模型（与自动从 Hub 下载模型代码相反）。

只要您的配置具有与现有模型类型不同的model_type属性，并且您的模型类具有正确的config_class属性，您就可以像这样将它们添加到自动类中：

from transformers import AutoConfig, AutoModel, AutoModelForImageClassification

AutoConfig.register("resnet", ResnetConfig)
AutoModel.register(ResnetConfig, ResnetModel)
AutoModelForImageClassification.register(ResnetConfig, ResnetModelForImageClassification)

注意，在将自定义配置注册到 AutoConfig 时使用的第一个参数需要与自定义配置的model_type匹配，并且在将自定义模型注册到任何自动模型类时使用的第一个参数需要与这些模型的config_class匹配。

将代码发送到 Hub

此 API 是实验性的，可能在下一个版本中有一些轻微的破坏性更改。

首先，请确保您的模型在一个.py文件中完全定义。它可以依赖于一些其他文件的相对导入，只要所有文件都在同一个目录中（我们目前还不支持此功能的子模块）。对于我们的示例，我们将在当前工作目录中的一个名为resnet_model的文件夹中定义一个modeling_resnet.py文件和一个configuration_resnet.py文件。配置文件包含ResnetConfig的代码，建模文件包含ResnetModel和ResnetModelForImageClassification的代码。

.
└── resnet_model
    ├── __init__.py
    ├── configuration_resnet.py
    └── modeling_resnet.py

__init__.py可以为空，只是为了让 Python 检测resnet_model可以用作模块。

如果从库中复制建模文件，则需要将文件顶部的所有相对导入替换为从transformers包导入。

请注意，您可以重复使用（或子类化）现有的配置/模型。

要与社区共享您的模型，请按照以下步骤操作：首先从新创建的文件中导入 ResNet 模型和配置：

from resnet_model.configuration_resnet import ResnetConfig
from resnet_model.modeling_resnet import ResnetModel, ResnetModelForImageClassification

然后，当使用save_pretrained方法时，您必须告诉库您要复制这些对象的代码文件，并使用给定的 Auto 类正确注册它们（特别是对于模型），只需运行：

ResnetConfig.register_for_auto_class()
ResnetModel.register_for_auto_class("AutoModel")
ResnetModelForImageClassification.register_for_auto_class("AutoModelForImageClassification")

请注意，无需为配置指定自动类（它们只有一个自动类，AutoConfig），但对于模型来说情况不同。您的自定义模型可能适用于许多不同的任务，因此您必须指定哪个自动类是您模型的正确类。

如果要将代码文件复制，可以使用register_for_auto_class()。如果您更喜欢从另一个存储库中的 Hub 使用代码，则无需调用它。在存在多个自动类的情况下，可以直接修改config.json，使用以下结构：

"auto_map": {     
	"AutoConfig": "<your-repo-name>--<config-name>",     
	"AutoModel": "<your-repo-name>--<config-name>",
	"AutoModelFor<Task>": "<your-repo-name>--<config-name>",    
},

接下来，让我们像以前一样创建配置和模型：

resnet50d_config = ResnetConfig(block_type="bottleneck", stem_width=32, stem_type="deep", avg_down=True)
resnet50d = ResnetModelForImageClassification(resnet50d_config)

pretrained_model = timm.create_model("resnet50d", pretrained=True)
resnet50d.model.load_state_dict(pretrained_model.state_dict())

现在要将模型发送到 Hub，请确保您已登录。在终端中运行：

huggingface-cli login

或者从笔记本中：

from huggingface_hub import notebook_login

notebook_login()

然后，您可以像这样推送到您自己的命名空间（或您是其成员的组织）：

resnet50d.push_to_hub("custom-resnet50d")

除了以 json 格式复制建模权重和配置外，还将建模和配置的.py文件复制到custom-resnet50d文件夹中，并将结果上传到 Hub。您可以在此model repo中查看结果。

有关将模型推送到 Hub 的方法的更多信息，请参阅共享教程。

使用具有自定义代码的模型

您可以在其存储库中使用任何配置、模型或分词器与自定义代码文件，使用自动类和from_pretrained方法。所有上传到 Hub 的文件和代码都会被扫描以检测恶意软件（有关更多信息，请参阅Hub 安全文档），但您仍应查看模型代码和作者，以避免在您的计算机上执行恶意代码。设置trust_remote_code=True以使用具有自定义代码的模型：

from transformers import AutoModelForImageClassification

model = AutoModelForImageClassification.from_pretrained("sgugger/custom-resnet50d", trust_remote_code=True)

强烈建议传递提交哈希作为revision，以确保模型的作者没有使用一些恶意的新代码更新代码（除非您完全信任模型的作者）。

commit_hash = "ed94a7c6247d8aedce4647f00f20de6875b5b292"
model = AutoModelForImageClassification.from_pretrained(
    "sgugger/custom-resnet50d", trust_remote_code=True, revision=commit_hash
)

请注意，在 Hub 上浏览模型存储库的提交历史时，有一个按钮可以轻松复制任何提交的提交哈希。

聊天模型的模板

原始文本：huggingface.co/docs/transformers/v4.37.2/en/chat_templating

介绍

越来越常见的 LLMs 的用例是聊天。在聊天环境中，模型不是继续单个文本字符串（这是标准语言模型的情况），而是继续由一个或多个消息组成的对话，每个消息包括一个角色，如“用户”或“助手”，以及消息文本。

与标记化类似，不同的模型对于聊天期望非常不同的输入格式。这就是我们将聊天模板作为一个特性添加的原因。聊天模板是分词器的一部分。它们指定如何将表示为消息列表的对话转换为模型期望的单个可标记化字符串的格式。

让我们通过使用 BlenderBot 模型的一个快速示例来具体化这一点。BlenderBot 有一个非常简单的默认模板，主要是在对话轮之间添加空格：

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")

>>> chat = [
...    {"role": "user", "content": "Hello, how are you?"},
...    {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
...    {"role": "user", "content": "I'd like to show off how chat templating works!"},
... ]

>>> tokenizer.apply_chat_template(chat, tokenize=False)
" Hello, how are you?  I'm doing great. How can I help you today?   I'd like to show off how chat templating works!</s>"

请注意整个聊天被压缩成一个字符串。如果我们使用 tokenize=True，这是默认设置，那么该字符串也将被标记化。然而，为了看到一个更复杂的模板在操作中的效果，让我们使用 mistralai/Mistral-7B-Instruct-v0.1 模型。

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

>>> chat = [
...   {"role": "user", "content": "Hello, how are you?"},
...   {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
...   {"role": "user", "content": "I'd like to show off how chat templating works!"},
... ]

>>> tokenizer.apply_chat_template(chat, tokenize=False)
"<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"

请注意，这次分词器已经添加了控制标记 [INST] 和 [/INST] 来指示用户消息的开始和结束（但不包括助手消息！）。Mistral-instruct 是使用这些标记进行训练的，但 BlenderBot 没有。

如何使用聊天模板？

正如您在上面的示例中所看到的，聊天模板很容易使用。只需构建一个带有 role 和 content 键的消息列表，然后将其传递给 apply_chat_template() 方法。一旦您这样做了，您将得到准备好的输出！当将聊天模板用作模型生成的输入时，使用 add_generation_prompt=True 添加一个生成提示也是一个好主意。

这是准备输入给 model.generate() 的示例，使用 Zephyr 助手模型：

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "HuggingFaceH4/zephyr-7b-beta"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)  # You may want to use bfloat16 and/or move to GPU here

messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))

这将产生一个符合 Zephyr 期望的输入格式的字符串。

<|system|>
You are a friendly chatbot who always responds in the style of a pirate</s> 
<|user|>
How many helicopters can a human eat in one sitting?</s> 
<|assistant|>

现在我们的输入已经正确格式化为 Zephyr，我们可以使用模型为用户的问题生成响应。

outputs = model.generate(tokenized_chat, max_new_tokens=128) 
print(tokenizer.decode(outputs[0]))

这将产生：

<|system|>
You are a friendly chatbot who always responds in the style of a pirate</s> 
<|user|>
How many helicopters can a human eat in one sitting?</s> 
<|assistant|>
Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all.

啊，原来如此简单！

是否有用于聊天的自动化管道？

是的，有：ConversationalPipeline。这个管道旨在使使用聊天模型变得容易。让我们再次尝试 Zephyr 示例，但这次使用管道：

from transformers import pipeline

pipe = pipeline("conversational", "HuggingFaceH4/zephyr-7b-beta")
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
print(pipe(messages))

Conversation id: 76d886a0-74bd-454e-9804-0467041a63dc
system: You are a friendly chatbot who always responds in the style of a pirate
user: How many helicopters can a human eat in one sitting?
assistant: Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all.

ConversationalPipeline 将负责所有的标记化细节，并为您调用 apply_chat_template - 一旦模型有了聊天模板，您所需要做的就是初始化管道并将消息列表传递给它！

“生成提示”是什么？

您可能已经注意到 apply_chat_template 方法有一个 add_generation_prompt 参数。这个参数告诉模板添加指示机器人响应开始的标记。例如，考虑以下聊天：

messages = [
    {"role": "user", "content": "Hi there!"},
    {"role": "assistant", "content": "Nice to meet you!"},
    {"role": "user", "content": "Can I ask a question?"}
]

这是没有生成提示的样子，使用我们在 Zephyr 示例中看到的 ChatML 模板：

tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
"""<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
"""

这是带有生成提示的样子：

tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
"""<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
<|im_start|>assistant
"""

请注意，这一次，我们添加了指示机器人响应开始的标记。这确保了当模型生成文本时，它将写入一个机器人响应，而不是做一些意外的事情，比如继续用户的消息。请记住，聊天模型仍然只是语言模型 - 它们被训练来继续文本，而聊天只是对它们来说的一种特殊文本！您需要使用适当的控制标记来指导它们知道应该做什么。

并非所有模型都需要生成提示。一些模型，如 BlenderBot 和 LLaMA，在机器人响应之前没有任何特殊标记。在这些情况下，add_generation_prompt参数将不起作用。add_generation_prompt的确切效果将取决于所使用的模板。

我可以在训练中使用聊天模板吗？

是的！我们建议您将聊天模板应用为数据集的预处理步骤。之后，您可以像处理任何其他语言模型训练任务一样继续。在训练时，通常应设置add_generation_prompt=False，因为在训练过程中，添加的提示助手响应的标记将不会有帮助。让我们看一个例子：

from transformers import AutoTokenizer
from datasets import Dataset

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")

chat1 = [
    {"role": "user", "content": "Which is bigger, the moon or the sun?"},
    {"role": "assistant", "content": "The sun."}
]
chat2 = [
    {"role": "user", "content": "Which is bigger, a virus or a bacterium?"},
    {"role": "assistant", "content": "A bacterium."}
]

dataset = Dataset.from_dict({"chat": [chat1, chat2]})
dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)})
print(dataset['formatted_chat'][0])

然后我们得到：

<|user|>
Which is bigger, the moon or the sun?</s>
<|assistant|>
The sun.</s>

从这里开始，就像处理标准语言建模任务一样继续训练，使用formatted_chat列。

高级：聊天模板如何工作？

模型的聊天模板存储在tokenizer.chat_template属性中。如果没有设置聊天模板，则将使用该模型类的默认模板。让我们看一下BlenderBot的模板：


>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")

>>> tokenizer.default_chat_template
"{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ '  ' }}{% endif %}{% endfor %}{{ eos_token }}"

这有点令人生畏。让我们添加一些换行和缩进，使其更易读。请注意，每个块后的第一个换行以及块之前的任何前导空格默认情况下会被忽略，使用 Jinja 的trim_blocks和lstrip_blocks标志。但是，请谨慎 - 尽管每行的前导空格被剥离，但同一行上块之间的空格不会被剥离。我们强烈建议检查您的模板是否在不应该的地方打印额外的空格！

{% for message in messages %}  {% if message['role'] == 'user' %}  {{ ' ' }}  {% endif %}  {{ message['content'] }}  {% if not loop.last %}  {{ '  ' }}  {% endif %}  {% endfor %}  {{ eos_token }}

如果您以前从未见过这种模板，这是一个Jinja 模板。Jinja 是一种模板语言，允许您编写生成文本的简单代码。在许多方面，代码和语法类似于 Python。在纯 Python 中，这个模板看起来会像这样：

for idx, message in enumerate(messages):
    if message['role'] == 'user':
        print(' ')
    print(message['content'])
    if not idx == len(messages) - 1:  # Check for the last message in the conversation
        print('  ')
print(eos_token)

实际上，模板执行三件事：

对于每条消息，如果消息是用户消息，则在其前添加一个空格，否则不打印任何内容。
添加消息内容
如果消息不是最后一条消息，请在其后添加两个空格。在最后一条消息之后，打印 EOS 标记。

这是一个非常简单的模板 - 它不添加任何控制标记，也不支持“系统”消息，这是一种常见的方式，用于向模型提供关于其在随后对话中应该如何行为的指令。但是 Jinja 为您提供了很大的灵活性来执行这些操作！让我们看一个 Jinja 模板，可以类似于 LLaMA 格式化输入（请注意，真正的 LLaMA 模板包括处理默认系统消息以及一般情况下稍有不同的系统消息处理 - 不要在实际代码中使用这个！）

{% for message in messages %}  {% if message['role'] == 'user' %}  {{ bos_token + '[INST] ' + message['content'] + ' [/INST]' }}  {% elif message['role'] == 'system' %}  {{ '<<SYS>>\\n' + message['content'] + '\\n<</SYS>>\\n\\n' }}  {% elif message['role'] == 'assistant' %}  {{ ' '  + message['content'] + ' ' + eos_token }}  {% endif %}  {% endfor %}

希望如果您仔细看一下，您就能看出这个模板在做什么 - 它根据每条消息的“角色”添加特定的标记，这些标记代表发送者是谁。用户、助手和系统消息因为它们被包裹在其中的标记而清晰可辨。

高级：添加和编辑聊天模板

如何创建聊天模板？

简单，只需编写一个 Jinja 模板并设置tokenizer.chat_template。您可能会发现，从另一个模型的现有模板开始，并为您的需求简单编辑它会更容易！例如，我们可以采用上面的 LLaMA 模板，并为助手消息添加"[ASST]"和"[/ASST]"：

{% for message in messages %}  {% if message['role'] == 'user' %}  {{ bos_token + '[INST] ' + message['content'].strip() + ' [/INST]' }}  {% elif message['role'] == 'system' %}  {{ '<<SYS>>\\n' + message['content'].strip() + '\\n<</SYS>>\\n\\n' }}  {% elif message['role'] == 'assistant' %}  {{ '[ASST] '  + message['content'] + ' [/ASST]' + eos_token }}  {% endif %}  {% endfor %}

现在，只需设置tokenizer.chat_template属性。下次使用 apply_chat_template()，它将使用您的新模板！此属性将保存在tokenizer_config.json文件中，因此您可以使用 push_to_hub()将您的新模板上传到 Hub，并确保每个人都在使用正确的模板来使用您的模型！

template = tokenizer.chat_template
template = template.replace("SYS", "SYSTEM")  # Change the system token
tokenizer.chat_template = template  # Set the new template
tokenizer.push_to_hub("model_name")  # Upload your new template to the Hub!

使用您的聊天模板的方法 apply_chat_template()由 ConversationalPipeline 类调用，因此一旦您设置了正确的聊天模板，您的模型将自动与 ConversationalPipeline 兼容。

“默认”模板是什么？

在引入聊天模板之前，聊天处理是在模型类级别上硬编码的。为了向后兼容，我们保留了这种特定类处理作为默认模板，也在类级别上设置了。如果一个模型没有设置聊天模板，但是它的模型类有一个默认模板，ConversationalPipeline类和apply_chat_template等方法将使用类模板。您可以通过检查tokenizer.default_chat_template属性来查找您的分词器的默认模板。

这是我们纯粹为了向后兼容性的原因而做的事情，以避免破坏任何现有的工作流程。即使类模板适用于您的模型，我们强烈建议通过将chat_template属性显式设置来覆盖默认模板，以便向用户明确表明您的模型已正确配置为聊天，并为将来防范默认模板被修改或弃用的情况做好准备。

我应该使用哪个模板？

当为已经训练过的聊天模型设置模板时，您应该确保模板与模型在训练过程中看到的消息格式完全匹配，否则您可能会遇到性能下降。即使您继续训练模型，也是如此 - 如果保持聊天标记不变，您可能会获得最佳性能。这与标记化非常类似 - 在推理或微调时，当您精确匹配训练过程中使用的标记化时，通常会获得最佳性能。

如果您从头开始训练模型，或者在另一方面微调基础语言模型以用于聊天，您有很大的自由选择适当的模板！LLMs 足够聪明，可以学会处理许多不同的输入格式。我们为没有特定类别模板的模型提供的默认模板遵循 ChatML 格式，对于许多用例来说，这是一个很好的、灵活的选择。它看起来像这样：

{% for message in messages %}  {{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}  {% endfor %}

如果您喜欢这个，这里有一个一行代码形式的版本，可以直接复制到您的代码中。这个一行代码还包括对生成提示的方便支持，但请注意它不会添加 BOS 或 EOS 标记！如果您的模型需要这些标记，apply_chat_template不会自动添加它们 - 换句话说，文本将被使用add_special_tokens=False进行标记化。这是为了避免模板和add_special_tokens逻辑之间的潜在冲突。如果您的模型需要特殊标记，请确保将它们添加到模板中！

tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"

此模板将每条消息封装在<|im_start|>和<|im_end|>令牌中，并简单地将角色写入字符串，这样可以灵活地使用训练的角色。输出如下所示：

<|im_start|>system
You are a helpful chatbot that will do its best not to say anything so stupid that people tweet about it.<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
I'm doing great!<|im_end|>

“用户”、“系统”和“助手”角色是聊天的标准角色，我们建议在有意义的情况下使用它们，特别是如果您希望您的模型在 ConversationalPipeline 中运行良好。但是，您不限于这些角色 - 模板非常灵活，任何字符串都可以是一个角色。

我想添加一些聊天模板！我应该如何开始？

如果您有任何聊天模型，您应该设置它们的tokenizer.chat_template属性，并使用 apply_chat_template()进行测试，然后将更新后的 tokenizer 推送到 Hub。即使您不是模型所有者 - 如果您使用的模型具有空白聊天模板，或者仍在使用默认类模板，请打开一个拉取请求到模型存储库，以便正确设置此属性！

一旦设置了属性，就完成了！tokenizer.apply_chat_template现在将正确地为该模型工作，这意味着它也会自动支持像ConversationalPipeline这样的地方！

通过确保模型具有此属性，我们可以确保整个社区都能够使用开源模型的全部功能。格式不匹配已经困扰该领域并悄悄损害了性能太久了 - 是时候结束它们了！

高级：模板编写提示

如果您对 Jinja 不熟悉，我们通常发现编写聊天模板的最简单方法是首先编写一个格式化消息的 Python 脚本，然后将该脚本转换为模板。

记住模板处理程序将接收对话历史作为名为messages的变量。每条消息都是一个带有两个键role和content的字典。您可以在模板中像在 Python 中一样访问messages，这意味着您可以使用{% for message in messages %}循环遍历它，或者例如使用{{ messages[0] }}访问单个消息。

您还可以使用以下提示将您的代码转换为 Jinja：

对于循环

Jinja 中的 for 循环如下所示：

{% for message in messages %}  {{ message['content'] }}  {% endfor %}

请注意，无论{{表达式块}}中有什么都将打印到输出中。您可以在表达式块内使用+等运算符来组合字符串。

if 语句

Jinja 中的 if 语句如下所示：

{% if message['role'] == 'user' %}  {{ message['content'] }}  {% endif %}

请注意，Python 使用空格来标记for和if块的开始和结束位置，而 Jinja 要求您使用{% endfor %}和{% endif %}显式结束它们。

特殊变量

在您的模板中，您将可以访问messages列表，但也可以访问几个其他特殊变量。这些包括像bos_token和eos_token这样的特殊标记，以及我们上面讨论过的add_generation_prompt变量。您还可以使用loop变量来访问有关当前循环迭代的信息，例如使用{% if loop.last %}来检查当前消息是否是对话中的最后一条消息。以下是一个将这些想法结合在一起，在对话结束时添加生成提示的示例，如果add_generation_prompt为True：

{% if loop.last and add_generation_prompt %}  {{ bos_token + 'Assistant:\n' }}  {% endif %}

空格注意事项

尽可能地，我们已经尝试让 Jinja 忽略{{表达式}}之外的空格。但是，请注意，Jinja 是一个通用的模板引擎，它可能会将同一行上块之间的空格视为重要并将其打印到输出中。我们强烈建议在上传模板之前检查您的模板是否在不应该的地方打印额外的空格！

Trainer

原文链接：huggingface.co/docs/transformers/v4.37.2/en/trainer

Trainer 是在 Transformers 库中实现的 PyTorch 模型的完整训练和评估循环。您只需要传递训练所需的必要部分（模型、分词器、数据集、评估函数、训练超参数等），Trainer 类会处理其余部分。这使得更容易开始训练，而无需手动编写自己的训练循环。但同时，Trainer 非常可定制，并提供大量的训练选项，因此您可以根据自己的训练需求进行定制。

除了 Trainer 类外，Transformers 还提供了一个 Seq2SeqTrainer 类，用于序列到序列任务，如翻译或摘要。还有来自TRL库的SFTTrainer类，它包装了 Trainer 类，并针对使用自回归技术的 Llama-2 和 Mistral 等语言模型进行了优化。SFTTrainer还支持序列打包、LoRA、量化和 DeepSpeed 等功能，以有效地扩展到任何模型大小。

随时查看 API 参考以了解其他 Trainer 类型类的更多信息，以便了解何时使用哪种。一般来说，Trainer 是最通用的选择，适用于广泛的任务。Seq2SeqTrainer 专为序列到序列任务设计，而SFTTrainer专为训练语言模型设计。

在开始之前，请确保已安装Accelerate - 一个用于在分布式环境中启用和运行 PyTorch 训练的库。

pip install accelerate

# upgrade
pip install accelerate --upgrade

这个指南提供了 Trainer 类的概述。

基本用法

Trainer 包含在基本训练循环中找到的所有代码：

执行训练步骤来计算损失
使用backward方法计算梯度
根据梯度更新权重
重复这个过程，直到达到预定的 epoch 数量

Trainer 类将所有这些代码抽象化，因此您无需每次手动编写训练循环，或者如果您刚开始使用 PyTorch 和训练时担心。您只需要提供训练所需的基本组件，如模型和数据集，Trainer 类会处理其他一切。

如果要指定任何训练选项或超参数，您可以在 TrainingArguments 类中找到它们。例如，让我们定义在output_dir中保存模型的位置，并在训练后使用push_to_hub=True将模型推送到 Hub。

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="your-model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

将training_args传递给 Trainer，以及一个模型、数据集、用于预处理数据集的内容（根据数据类型可能是令牌化器、特征提取器或图像处理器）、数据整理器和一个函数来计算您想要在训练过程中跟踪的指标。

最后，调用 train()开始训练！

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

检查点

Trainer 类将您的模型检查点保存到 TrainingArguments 中指定的目录中的output_dir参数。您将在checkpoint-000子文件夹中找到保存的检查点，其中末尾的数字对应训练步骤。保存检查点对于稍后恢复训练很有用。

# resume from latest checkpoint
trainer.train(resume_from_checkpoint=True)

# resume from specific checkpoint saved in output directory
trainer.train(resume_from_checkpoint="your-model/checkpoint-1000")

您可以通过在 TrainingArguments 中设置push_to_hub=True将检查点保存到 Hub 以提交和推送它们（默认情况下不保存优化器状态）。设置如何保存检查点的其他选项在hub_strategy参数中设置：

hub_strategy="checkpoint" 将最新的检查点推送到名为“last-checkpoint”的子文件夹，您可以从中恢复训练
hug_strategy="all_checkpoints" 将所有检查点推送到output_dir中定义的目录（您将在模型存储库中看到每个文件夹中的一个检查点）

当您从检查点恢复训练时，Trainer 会尝试保持 Python、NumPy 和 PyTorch RNG 状态与保存检查点时相同。但由于 PyTorch 具有各种非确定性的默认设置，RNG 状态不能保证相同。如果要启用完全确定性，请查看控制随机性源指南，了解您可以启用哪些内容以使您的训练完全确定性。请记住，通过使某些设置确定性，训练可能会变慢。

自定义 Trainer

虽然 Trainer 类旨在易于访问和使用，但也为更有冒险精神的用户提供了许多可定制性。许多 Trainer 的方法可以被子类化和重写，以支持您想要的功能，而无需重写整个训练循环以适应它。这些方法包括：

get_train_dataloader() 创建一个训练 DataLoader
get_eval_dataloader() 创建一个评估 DataLoader
get_test_dataloader() 创建一个测试 DataLoader
log() 记录监视训练的各种对象的信息
create_optimizer_and_scheduler() 在__init__中没有传入优化器和学习率调度器时创建它们；也可以使用 create_optimizer()和 create_scheduler()进行单独定制
compute_loss() 计算一批训练输入的损失
training_step() 执行训练步骤
prediction_step() 执行预测和测试步骤
evaluate() 评估模型并返回评估指标
predict() 在测试集上进行预测（如果有标签，则带有指标）

例如，如果您想要自定义 compute_loss()方法以使用加权损失。

from torch import nn
from transformers import Trainer

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # compute custom loss for 3 labels with different weights
        loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0], device=model.device))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

回调

自定义 Trainer 的另一个选项是使用 callbacks。回调不会改变训练循环中的任何内容。它们检查训练循环状态，然后根据状态执行某些操作（提前停止、记录结果等）。换句话说，回调不能用于实现类似自定义损失函数的内容，您需要子类化并重写 compute_loss()方法。

例如，如果您想在训练循环中的 10 步后添加一个提前停止回调。

from transformers import TrainerCallback

class EarlyStoppingCallback(TrainerCallback):
    def __init__(self, num_steps=10):
        self.num_steps = num_steps

    def on_step_end(self, args, state, control, **kwargs):
        if state.global_step >= self.num_steps:
            return {"should_training_stop": True}
        else:
            return {}

然后将其传递给 Trainer 的callback参数。

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callback=[EarlyStoppingCallback()],
)

日志

查看 logging API 参考以获取有关不同日志级别的更多信息。

默认情况下，Trainer 设置为logging.INFO，报告错误、警告和其他基本信息。在分布式环境中，Trainer 的副本设置为logging.WARNING，仅报告错误和警告。您可以使用 TrainingArguments 中的log_level和log_level_replica参数更改日志级别。

要为每个节点配置日志级别设置，请使用log_on_each_node参数来确定是在每个节点上使用日志级别还是仅在主节点上使用。

Trainer 在Trainer.__init__()方法中为每个节点单独设置日志级别，因此如果您在创建 Trainer 对象之前使用其他 Transformers 功能，可能需要考虑更早设置这个。

例如，要根据每个节点设置主代码和模块使用相同的日志级别：

logger = logging.getLogger(__name__)

logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    handlers=[logging.StreamHandler(sys.stdout)],
)

log_level = training_args.get_process_log_level()
logger.setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)

trainer = Trainer(...)

使用不同组合的log_level和log_level_replica来配置每个节点上记录什么。

单节点多节点

my_app.py ... --log_level warning --log_level_replica error

NEFTune

NEFTune是一种通过在训练过程中向嵌入向量添加噪音来提高性能的技术。要在 Trainer 中启用它，设置 TrainingArguments 中的neftune_noise_alpha参数来控制添加多少噪音。

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(..., neftune_noise_alpha=0.1)
trainer = Trainer(..., args=training_args)

训练后禁用 NEFTune，以恢复原始嵌入层，避免任何意外行为。

加速和 Trainer

Trainer 类由Accelerate支持，这是一个用于在分布式环境中轻松训练 PyTorch 模型的库，支持集成如FullyShardedDataParallel (FSDP)和DeepSpeed。

在 Trainer 的 Fully Sharded Data Parallel 指南中了解更多关于 FSDP 分片策略、CPU 卸载等内容。

使用 Trainer 与 Accelerate，运行accelerate.config命令来设置您的训练环境。这个命令会创建一个config_file.yaml，在您启动训练脚本时会用到。例如，您可以设置一些示例配置：

DistributedDataParallelFSDPDeepSpeedDeepSpeed 与 Accelerate 插件

compute_environment: LOCAL_MACHINE                                                                                             
distributed_type: MULTI_GPU                                                                                                    
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0 #change rank as per the node
main_process_ip: 192.168.20.1
main_process_port: 9898
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

accelerate_launch命令是在分布式系统上启动您的训练脚本的推荐方式，使用 Accelerate 和 Trainer 中在config_file.yaml中指定的参数。这个文件保存在 Accelerate 缓存文件夹中，并在运行accelerate_launch时自动加载。

例如，使用 FSDP 配置运行run_glue.py训练脚本：

accelerate launch \
    ./examples/pytorch/text-classification/run_glue.py \
    --model_name_or_path bert-base-cased \
    --task_name $TASK_NAME \
    --do_train \
    --do_eval \
    --max_seq_length 128 \
    --per_device_train_batch_size 16 \
    --learning_rate 5e-5 \
    --num_train_epochs 3 \
    --output_dir /tmp/$TASK_NAME/ \
    --overwrite_output_dir

您也可以直接在命令行中指定来自config_file.yaml文件的参数：

accelerate launch --num_processes=2 \
    --use_fsdp \
    --mixed_precision=bf16 \
    --fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP  \
    --fsdp_transformer_layer_cls_to_wrap="BertLayer" \
    --fsdp_sharding_strategy=1 \
    --fsdp_state_dict_type=FULL_STATE_DICT \
    ./examples/pytorch/text-classification/run_glue.py
    --model_name_or_path bert-base-cased \
    --task_name $TASK_NAME \
    --do_train \
    --do_eval \
    --max_seq_length 128 \
    --per_device_train_batch_size 16 \
    --learning_rate 5e-5 \
    --num_train_epochs 3 \
    --output_dir /tmp/$TASK_NAME/ \
    --overwrite_output_dir

查看启动您的 Accelerate 脚本教程，了解更多关于accelerate_launch和自定义配置的内容。

在 Amazon SageMaker 上运行训练

原始文本：huggingface.co/docs/transformers/v4.37.2/en/sagemaker

文档已移至 hf.co/docs/sagemaker。此页面将在 transformers 5.0 中被移除。

导出为 ONNX

原始文本：huggingface.co/docs/transformers/v4.37.2/en/serialization

在生产环境中部署🤗 Transformers 模型通常需要将模型导出为可以在专用运行时和硬件上加载和执行的序列化格式。

🤗 Optimum 是 Transformers 的扩展，通过其exporters模块使得可以将模型从 PyTorch 或 TensorFlow 导出为 ONNX 和 TFLite 等序列化格式。🤗 Optimum 还提供了一套性能优化工具，以在目标硬件上以最大效率训练和运行模型。

本指南演示了如何使用🤗 Optimum 将🤗 Transformers 模型导出为 ONNX，有关将模型导出为 TFLite 的指南，请参考导出到 TFLite 页面。

导出为 ONNX

ONNX（Open Neural Network eXchange）是一个开放标准，定义了一组通用操作符和一种通用文件格式，用于在各种框架中表示深度学习模型，包括 PyTorch 和 TensorFlow。当模型导出为 ONNX 格式时，这些操作符用于构建计算图（通常称为中间表示），表示数据通过神经网络的流动。

通过公开具有标准化操作符和数据类型的图，ONNX 使得在不同框架之间轻松切换变得容易。例如，在 PyTorch 中训练的模型可以导出为 ONNX 格式，然后在 TensorFlow 中导入（反之亦然）。

将模型导出为 ONNX 格式后，可以：

通过诸如图优化和量化等技术进行推理优化。
使用 ONNX Runtime 运行，通过ORTModelForXXX类，其遵循与🤗 Transformers 中您习惯的AutoModel API 相同。
使用优化推理流水线运行，其 API 与🤗 Transformers 中的 pipeline()函数相同。

🤗 Optimum 通过利用配置对象提供对 ONNX 导出的支持。这些配置对象已经为许多模型架构准备好，并且设计为易于扩展到其他架构。

有关现成配置列表，请参阅🤗 Optimum 文档。

有两种将🤗 Transformers 模型导出为 ONNX 的方法，这里我们展示两种：

通过 CLI 使用🤗 Optimum 导出。
使用optimum.onnxruntime与🤗 Optimum 导出。

使用 CLI 将🤗 Transformers 模型导出为 ONNX

要将🤗 Transformers 模型导出为 ONNX，首先安装额外的依赖项：

pip install optimum[exporters]

要查看所有可用参数，请参考🤗 Optimum 文档，或在命令行中查看帮助：

optimum-cli export onnx --help

要从🤗 Hub 导出模型的检查点，例如distilbert-base-uncased-distilled-squad，请运行以下命令：

optimum-cli export onnx --model distilbert-base-uncased-distilled-squad distilbert_base_uncased_squad_onnx/

您应该看到指示进度并显示结果model.onnx保存位置的日志，如下所示：

Validating ONNX model distilbert_base_uncased_squad_onnx/model.onnx...
	-[✓] ONNX model output names match reference model (start_logits, end_logits)
	- Validating ONNX Model output "start_logits":
		-[✓] (2, 16) matches (2, 16)
		-[✓] all values close (atol: 0.0001)
	- Validating ONNX Model output "end_logits":
		-[✓] (2, 16) matches (2, 16)
		-[✓] all values close (atol: 0.0001)
The ONNX export succeeded and the exported model was saved at: distilbert_base_uncased_squad_onnx

上面的示例说明了从 🤗 Hub 导出检查点。在导出本地模型时，首先确保您将模型的权重和分词器文件保存在同一个目录中（local_path）。在使用 CLI 时，将 local_path 传递给 model 参数，而不是在 🤗 Hub 上提供检查点名称，并提供 --task 参数。您可以在 🤗 Optimum 文档中查看支持的任务列表。如果未提供 task 参数，它将默认为没有任何特定任务头的模型架构。

optimum-cli export onnx --model local_path --task question-answering distilbert_base_uncased_squad_onnx/

导出的 model.onnx 文件可以在支持 ONNX 标准的许多加速器中运行。例如，我们可以使用 ONNX Runtime 加载和运行模型如下：

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert_base_uncased_squad_onnx")
>>> model = ORTModelForQuestionAnswering.from_pretrained("distilbert_base_uncased_squad_onnx")
>>> inputs = tokenizer("What am I using?", "Using DistilBERT with ONNX Runtime!", return_tensors="pt")
>>> outputs = model(**inputs)

这个过程对于 Hub 上的 TensorFlow 检查点是相同的。例如，这里是如何从 Keras 组织导出一个纯 TensorFlow 检查点的：

optimum-cli export onnx --model keras-io/transformers-qa distilbert_base_cased_squad_onnx/

使用 optimum.onnxruntime 将 🤗 Transformers 模型导出到 ONNX

与 CLI 的替代方法是，您可以像这样以编程方式将 🤗 Transformers 模型导出到 ONNX：

>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> from transformers import AutoTokenizer

>>> model_checkpoint = "distilbert_base_uncased_squad"
>>> save_directory = "onnx/"

>>> # Load a model from transformers and export it to ONNX
>>> ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, export=True)
>>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

>>> # Save the onnx model and tokenizer
>>> ort_model.save_pretrained(save_directory)
>>> tokenizer.save_pretrained(save_directory)

将模型导出到不受支持的架构

如果您希望通过为当前无法导出的模型添加支持来做出贡献，您应该首先检查它是否在 optimum.exporters.onnx 中受支持，如果不是，可以直接为 🤗 Optimum 做出贡献。

使用 transformers.onnx 导出模型

tranformers.onnx 不再维护，请按照上述使用 🤗 Optimum 导出模型的方法。这一部分将在未来版本中被移除。

要使用 transformers.onnx 将 🤗 Transformers 模型导出到 ONNX，需要安装额外的依赖：

pip install transformers[onnx]

使用 transformers.onnx 包作为 Python 模块，使用现成的配置导出检查点：

python -m transformers.onnx --model=distilbert-base-uncased onnx/

这将导出由 --model 参数定义的检查点的 ONNX 图。传递任何在 🤗 Hub 上或本地存储的检查点。导出的 model.onnx 文件可以在支持 ONNX 标准的许多加速器中运行。例如，加载并使用 ONNX Runtime 运行模型如下：

>>> from transformers import AutoTokenizer
>>> from onnxruntime import InferenceSession

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
>>> session = InferenceSession("onnx/model.onnx")
>>> # ONNX Runtime expects NumPy arrays as input
>>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np")
>>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))

所需的输出名称（如 ["last_hidden_state"]）可以通过查看每个模型的 ONNX 配置来获得。例如，对于 DistilBERT，我们有：

>>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig

>>> config = DistilBertConfig()
>>> onnx_config = DistilBertOnnxConfig(config)
>>> print(list(onnx_config.outputs.keys()))
["last_hidden_state"]

这个过程对于 Hub 上的 TensorFlow 检查点是相同的。例如，像这样导出一个纯 TensorFlow 检查点：

python -m transformers.onnx --model=keras-io/transformers-qa onnx/

要导出存储在本地的模型，请将模型的权重和分词器文件保存在同一个目录中（例如 local-pt-checkpoint），然后通过将 transformers.onnx 包的 --model 参数指向所需目录来将其导出到 ONNX：

python -m transformers.onnx --model=local-pt-checkpoint onnx/

导出到 TFLite

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tflite

TensorFlow Lite 是一个轻量级框架，用于在资源受限的设备上部署机器学习模型，例如移动电话、嵌入式系统和物联网（IoT）设备。TFLite 旨在优化并在这些计算能力、内存和功耗有限的设备上高效运行模型。TensorFlow Lite 模型以一种特殊的高效可移植格式表示，通过 .tflite 文件扩展名进行识别。

🤗 Optimum 提供了通过 exporters.tflite 模块将 🤗 Transformers 模型导出到 TFLite 的功能。有关支持的模型架构列表，请参考 🤗 Optimum 文档。

要将模型导出到 TFLite，请安装所需的依赖项：

pip install optimum[exporters-tf]

要查看所有可用参数，请参考 🤗 Optimum 文档，或在命令行中查看帮助：

optimum-cli export tflite --help

例如，要从 🤗 Hub 导出模型的检查点，比如 bert-base-uncased，请运行以下命令：

optimum-cli export tflite --model bert-base-uncased --sequence_length 128 bert_tflite/

您应该看到指示进度并显示结果 model.tflite 保存位置的日志。

Validating TFLite model...
	-[✓] TFLite model output names match reference model (logits)
	- Validating TFLite Model output "logits":
		-[✓] (1, 128, 30522) matches (1, 128, 30522)
		-[x] values not close enough, max diff: 5.817413330078125e-05 (atol: 1e-05)
The TensorFlow Lite export succeeded with the warning: The maximum absolute difference between the output of the reference model and the TFLite exported model is not within the set tolerance 1e-05:
- logits: max diff = 5.817413330078125e-05.
 The exported model was saved at: bert_tflite

上面的示例说明了从 🤗 Hub 导出检查点。在导出本地模型时，首先确保您将模型的权重和分词器文件保存在同一个目录（local_path）中。在使用 CLI 时，将 local_path 传递给 model 参数，而不是在 🤗 Hub 上的检查点名称。

导出到 TorchScript

原文：huggingface.co/docs/transformers/v4.37.2/en/torchscript

这是我们使用 TorchScript 的实验的开始，我们仍在探索其对于可变输入大小模型的能力。这是我们感兴趣的焦点，我们将在即将发布的版本中深入分析，提供更多代码示例，更灵活的实现以及将 Python 代码与编译后的 TorchScript 进行比较的基准测试。

根据TorchScript 文档：

TorchScript 是一种从 PyTorch 代码创建可序列化和可优化模型的方法。

有两个 PyTorch 模块JIT 和 TRACE，允许开发人员将他们的模型导出以便在其他程序中重复使用，比如面向效率的 C++程序。

我们提供了一个接口，允许您将🤗 Transformers 模型导出到 TorchScript，以便在与基于 PyTorch 的 Python 程序不同的环境中重复使用。在这里，我们解释了如何使用 TorchScript 导出和使用我们的模型。

导出模型需要两件事：

使用torchscript标志实例化模型
使用虚拟输入进行前向传递

这些必需品意味着开发人员应该注意以下几点。

TorchScript 标志和绑定权重

torchscript标志是必需的，因为大多数🤗 Transformers 语言模型的Embedding层和Decoding层之间有绑定权重。TorchScript 不允许您导出具有绑定权重的模型，因此需要在此之前解开并克隆权重。

使用torchscript标志实例化的模型将它们的Embedding层和Decoding层分开，这意味着它们不应该在训练过程中进行训练。训练会使这两层不同步，导致意外结果。

对于没有语言模型头的模型，情况并非如此，因为这些模型没有绑定权重。这些模型可以安全地导出而不使用torchscript标志。

虚拟输入和标准长度

虚拟输入用于模型的前向传递。当输入的值通过层传播时，PyTorch 会跟踪每个张量上执行的不同操作。然后使用这些记录的操作来创建模型的trace。

跟踪是相对于输入维度创建的。因此，它受虚拟输入维度的限制，并且对于任何其他序列长度或批量大小都不起作用。当尝试使用不同大小时，会引发以下错误：

`The expanded size of the tensor (3) must match the existing size (7) at non-singleton dimension 2`

我们建议您使用至少与推理过程中将馈送到模型的最大输入一样大的虚拟输入大小来跟踪模型。填充可以帮助填补缺失的值。然而，由于模型是使用较大的输入大小跟踪的，矩阵的维度也会很大，导致更多的计算。

要注意每个输入上执行的总操作数，并在导出不同序列长度模型时密切关注性能。

在 Python 中使用 TorchScript

本节演示了如何保存和加载模型以及如何使用跟踪进行推理。

保存模型

要导出带有 TorchScript 的BertModel，请从BertConfig类实例化BertModel，然后将其保存到磁盘上的文件名为traced_bert.pt：

from transformers import BertModel, BertTokenizer, BertConfig
import torch

enc = BertTokenizer.from_pretrained("bert-base-uncased")

# Tokenizing input text
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = enc.tokenize(text)

# Masking one of the input tokens
masked_index = 8
tokenized_text[masked_index] = "[MASK]"
indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# Creating a dummy input
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
dummy_input = [tokens_tensor, segments_tensors]

# Initializing the model with the torchscript flag
# Flag set to True even though it is not necessary as this model does not have an LM Head.
config = BertConfig(
    vocab_size_or_config_json_file=32000,
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    intermediate_size=3072,
    torchscript=True,
)

# Instantiating the model
model = BertModel(config)

# The model needs to be in evaluation mode
model.eval()

# If you are instantiating the model with *from_pretrained* you can also easily set the TorchScript flag
model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)

# Creating the trace
traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
torch.jit.save(traced_model, "traced_bert.pt")

加载模型

现在，您可以加载先前保存的BertModel，traced_bert.pt，从磁盘上并在先前初始化的dummy_input上使用它：

loaded_model = torch.jit.load("traced_bert.pt")
loaded_model.eval()

all_encoder_layers, pooled_output = loaded_model(*dummy_input)

使用跟踪模型进行推理

通过使用其__call__ dunder 方法对推理使用跟踪模型：

traced_model(tokens_tensor, segments_tensors)

使用 Neuron SDK 将 Hugging Face TorchScript 模型部署到 AWS

AWS 推出了Amazon EC2 Inf1实例系列，用于在云中进行低成本、高性能的机器学习推理。Inf1 实例由 AWS Inferentia 芯片提供动力，这是一种专门用于深度学习推理工作负载的定制硬件加速器。AWS Neuron是用于 Inferentia 的 SDK，支持跟踪和优化 transformers 模型，以便在 Inf1 上部署。Neuron SDK 提供：

易于使用的 API，只需更改一行代码即可跟踪和优化 TorchScript 模型，以便在云中进行推理。
针对改进的成本性能进行即插即用的性能优化。
支持使用PyTorch或TensorFlow构建的 Hugging Face transformers 模型。

影响

基于BERT（来自 Transformers 的双向编码器表示）架构的 transformers 模型，或其变体，如distilBERT和roBERTa在非生成任务（如提取式问答、序列分类和标记分类）上在 Inf1 上运行效果最佳。然而，文本生成任务仍可以根据此AWS Neuron MarianMT 教程进行适应以在 Inf1 上运行。有关可以直接在 Inferentia 上转换的模型的更多信息，请参阅 Neuron 文档的模型架构适配部分。

依赖

使用 AWS Neuron 转换模型需要一个Neuron SDK 环境，该环境预先配置在AWS 深度学习 AMI上。

将模型转换为 AWS Neuron

使用与在 Python 中使用 TorchScript 相同的代码来为 AWS NEURON 转换模型，以跟踪BertModel。导入torch.neuron框架扩展以通过 Python API 访问 Neuron SDK 的组件：

from transformers import BertModel, BertTokenizer, BertConfig
import torch
import torch.neuron

您只需要修改以下行：

- torch.jit.trace(model, [tokens_tensor, segments_tensors])
+ torch.neuron.trace(model, [token_tensor, segments_tensors])

这使得 Neuron SDK 能够跟踪模型并为 Inf1 实例进行优化。

要了解有关 AWS Neuron SDK 功能、工具、示例教程和最新更新的更多信息，请参阅AWS NeuronSDK 文档。

posted @ 2024-06-22 14:14 绝不原创的飞龙阅读(34) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

龙哥盟

掠夺·扩张·投机·博弈

Transformers--4-37-中文文档-三-

Transformers 4.37 中文文档（三）

零样本图像分类

零样本图像分类管道

手动进行零样本图像分类

单目深度估计

深度估计管道

手动进行深度估计推断

图像到图像任务指南

计算机视觉知识蒸馏

多模态

图像字幕

加载 Pokemon BLIP 字幕数据集

预处理数据集

加载基础模型

评估

训练！

推理

文档问答

加载数据

预处理数据

预处理文档图像

预处理文本数据

评估

训练

推理

视觉问答

微调 ViLT

加载数据

数据预处理

训练模型

推理

零样本 VQA

文本到语音

加载数据集

预处理数据

SpeechT5 分词的文本清理

发言者

发言者嵌入

处理数据集

数据整理器

训练模型

推断

使用管道进行推断

手动运行推断

生成

文本生成策略

默认文本生成配置

自定义文本生成

保存带有您的模型的自定义解码策略

流式传输

解码策略

贪婪搜索

对比搜索

多项式抽样

束搜索解码

束搜索多项式抽样

多样束搜索解码

推测解码

提示

使用 IDEFICS 进行图像任务

加载模型

量化模型

图像加标题

提示的图像字幕

少量提示

视觉问题回答

图像分类

图像引导的文本生成

批量模式下运行推理

用于会话使用的 IDEFICS 指导

LLM 提示指南

提示的基础知识

模型类型

基础版 vs 指导/聊天版模型

自然语言处理任务

文本分类

命名实体识别