Automatic Verbalizer：自动识别可用作少样本文本分类标签的单词

阅读笔记

最近的小样本分类方法是将文本输入转换为包含某种形式的完形填空问题，使用预训练的语言模型对其进行处理，并将预测的单词映射到标签。手动定义单词和标签之间的映射需要领域专业知识和对语言模型能力的理解。

为了缓解这个问题，作者设计了一种方法，可以在给定少量训练数据的情况下自动找到这样的映射。作者的思路主要分为两大步：

● 首先在预训练语言模型的词汇表上找到候选标签词

● 然后在候选标签词里找到每个类的最终标签词

作者的目标是为每个类标签y自动找到最佳的标签词集，比如为类sports找到标签集{sports, football, team}。怎么找到这个最佳的v呢？在训练数据T中，存在这样的使得对每个样本x属于标签y的概率的乘积最大，用公式表示就是在给定x和v的情况下，参数v取最佳值使得标签y的条件概率最大。

最大似然估计的核心思想是利用已知的样本结果反推最有可能导致这种结果的参数值。假设类别个数为4，而与训练语言模型的词表大小为T，那么可能性有T的k次幂，如果我们一个一个的迭代搜索这是不现实的。 how to solve this problem? 作者提出将k类别分类问题转换为k个二分类问题。对每个标签y，寻找一个，使得模型M能够区分具有标签y的样本和具有其它任何标签的样本。这么说起来可能有点抽象，我们具体看一个例子：

probs = [
    [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]  1
    [0.3, 0.4, 0.2, 0.4, 0.1, 0.1]  0
    [0.1, 0.1, 0.2, 0.3, 0.4, 0.5]  1
]

如上假设是预训练语言模型词表中的单词在各个样本的概率分布。上面的probs的行表示词表中有6个词汇，列表示输入的样本总数是3个样本，后面的1，0，1表示类别，其中，1表示正类，0表示负类。probs表达的意思是词汇表中的词汇在输入样本的概率分布。好，现在我想找正类的标签词，怎么做呢？首先，计算词汇1在正类样本的概率分数，也就是0.1+0.1=0.2，再计算词汇2在正类样本的概率分数也就是0.2+0.1=0.3，再计算词汇3在正类样本的概率分数也就是0.3+0.2=0.5，以此类推，比如我们想选择3个词作为正类的标签词。在得到这6个词汇的概率分数之后，我们将其进行排序，得到概率分数最高的3个词，将得到的3个词作为正类的标签词，负类同理。实现的代码如下所示：

def _get_candidates(self,
                    num_candidates: int,
                    probs: torch.Tensor,
                    labels: torch.Tensor,
                    ) -> Dict[str, List[str]]:
    if num_candidates <= 0:
        return [torch.arange(self.vocab_size) for label_id in range(self.num_classes)]

    log_probs = torch.log(probs+1e-15)
    candidate_ids = []
    for label_id in range(self.num_classes):
        label_mask = (labels==label_id).to(torch.float).unsqueeze(-1)
        score = torch.sum(log_probs * label_mask, dim=0)
        candidate_id = torch.argsort(score, descending=True)[:num_candidates]
        candidate_ids.append(candidate_id)
    return candidate_ids

这样，我们就实现了从PLM的词表中为每个标签寻找到了num_candidates个候选标签词。利用这样的方法找到候选标签词我只能感慨确实充分利用了PLM训练时学习到的语义和语法知识。但是我们仔细想想，这样做有什么不妥当之处吗？有没有这样的词汇，在正类样本中概率高的同时也在负类样本中概率高呢？我想这样的词汇是存在的。但这num_candidates个候选标签词并不都是我们想要的，我们需要从候选标签词里再淘汰一部分。接下来我将直接以代码为主线讲解Automatic Verbalizer，以5-shot样本为例，也就是说，每个类的训练样本有5条。如果是agnews数据集，总共的输入样本就是20条。当我们用prompt_model(inputs)时，我们首先会得到一个维度为[20, 50265]的张量，如下图所示。20是总的输入样本数量，而50265是训练预训练语言模型的词汇表的大小，总共50265个词汇。

上面代码中一个比较技巧性但也很经常用的操作就是得到label_mask。得到候选词的ID列表只是我们的第一步（candidate_ids是一个[num_classes, num_candidates]形状的列表），接下来我们要从候选词中选择最佳的标签词，用代码实现就是下面的_get_top_words：

def _get_top_words(self,
                   probs: torch.Tensor,
                   candidates: List[torch.Tensor],
                   balance: bool = True,
                   words_per_label: int = 10,
                   score_fct: Optional[str] = 'llr'):
    label_words_ids = []
    for label_id in range(self.num_classes):
        label_mask = (self.labels_buffer==label_id).to(torch.float)
        print(label_mask)
        print(label_mask.shape)  # [20]，以5-shot为例
        probs_per_label = probs[:, candidates[label_id]] 
        print("probs_per_label：", probs_per_label)
        print(probs_per_label.shape) # [20, 1000]
        if score_fct == 'llr':
            s = self._log_likelihood_ratio(probs_per_label, label_mask, balance)
        elif score_fct == 'ce':
            s = self._cross_entropy(probs_per_label, label_mask, balance)
        else:
            raise ValueError(f"Score function '{score_fct}' not implemented")
        sorted_ids  = torch.argsort(s, descending=True)[:words_per_label]
        selected_ids = candidates[label_id][sorted_ids]
        label_words_ids.append(selected_ids)
    label_words_ids = torch.vstack(label_words_ids)
    return label_words_ids

上面的candidates参数就是_get_candidates()函数返回的candidate_ids，这其中有一个关键的代码 probs_per_label = probs[:, candidates[label_id]]，这样就可以得到每个类的候选标签词的概率分布。接下来要做的事情就和从预训练语言模型的词汇表中筛选出候选词类似：我们要根据每个标签词在每个类别的分数筛选出每个类最终需要的标签词。但是呢，这里就是本文的关键之处了，交叉熵损失和对数似然比损失。首先看一下公式吧。

首先，你别昏头，我们一步一步的来拆解这个公式，尽量给你讲明白。但是这个里面类别只有两类：正类和负类。是平衡因子，当即是正类的时候，平衡因子为1，如果是负类则为负样本的个数，那么T就是输入的总样本数了。这个是一个整体，表示样本在映射关系是P，候选标签词是的前提下，样本是类别的概率然后取一个对数。对应原文公式3：

上面计算的也就是正样本的分数，就是计算负样本的分数，这也很好的解释了neg_score = torch.sum(torch.log(1 - probs +1e-15) * scale_factor, dim=0)，然后将他们相加求和。

def _cross_entropy(self, probs, label_mask, balance):
    if balance:
        scale_factor =  torch.sum(label_mask) / torch.sum(1 - label_mask) \
                        * (1-label_mask).unsqueeze(-1)
    else:
        scale_factor = (1-label_mask).unsqueeze(-1)
    label_mask = label_mask.unsqueeze(-1)

    pos_score = torch.sum(torch.log(probs+1e-15) * label_mask, dim=0)
    neg_score = torch.sum(torch.log(1 - probs +1e-15) * scale_factor, dim=0)
    return pos_score + neg_score

因为是1000个候选词，所以每个类别有1000个分数，然后为每个类别选择words_per_label个分数最高的词汇作为最终的标签词。

def _get_top_words(self,
                   probs: torch.Tensor,
                   candidates: List[torch.Tensor],
                   balance: bool = True,
                   words_per_label: int = 10,
                   score_fct: Optional[str] = 'llr'):
    label_words_ids = []
    for label_id in range(self.num_classes):
        label_mask = (self.labels_buffer==label_id).to(torch.float)
        probs_per_label = probs[:, candidates[label_id]]
        if score_fct == 'llr':
            s = self._log_likelihood_ratio(probs_per_label, label_mask, balance)
        elif score_fct == 'ce':
            s = self._cross_entropy(probs_per_label, label_mask, balance)
        else:
            raise ValueError(f"Score function '{score_fct}' not implemented")
        sorted_ids  = torch.argsort(s, descending=True)[:words_per_label]
        selected_ids = candidates[label_id][sorted_ids]
        label_words_ids.append(selected_ids)
    label_words_ids = torch.vstack(label_words_ids)
    return label_words_ids

这样就完了吗？No，它是迭代操作，所以规定了迭代次数，如下述代码所示。但是实际上，我将迭代次数num_searches设为大于1的时候，它还是没有迭代。

def optimize_to_initialize(self):
    r"""This is an epoch-level optimize. If used in batch-level like an ordinary
    gradient descend optimizer, the result may not be very satisfying since the accumated
    examples (i.e., the probs_buffer and the labels_buffer) are not enough if the batchsize
    is small.
    """
    if self.search_id < self.num_searches:
        self.label_words_ids = self._find_verbalizer(words_per_label=self.label_word_num_per_class,
                                                     num_candidates=self.num_candidates,
                                                     score_fct=self.score_fct,
                                                     balance=self.balance)
        self.probs_buffer, self.labels_buffer = None, None
        self.search_id += 1
        if self.search_id == self.num_searches: # finish optimization
            self.accumulate = False
    else:
        print("Verbalizer's max num_searches reached, use the previous label words.")
    self._show_verbalizer()

到这这篇论文的基本思想就结束了，哦对了，论文中提到了最大似然比损失，这是因为交叉熵损失的缺点，当模型的负类比正类数量多时，那么模型预测负类的概率会大于正类，甚至可以说远大于，比如说负类数量80个，正类样本数量20个，那么预测负类的概率为0.8，预测正类的概率为0.2。根据交叉熵公式计算，交叉熵损失为0log0.8+1log0.2=log0.2，那么模型会偏向负类预测去降低损失。所以作者提出了对数似然比损失。

代码实现如下：

def _log_likelihood_ratio(self, probs, label_mask, balance):
    if balance:
        scale_factor =  torch.sum(label_mask) / torch.sum(1 - label_mask) \
                        * (1-label_mask).unsqueeze(-1)
    else:
        scale_factor = (1-label_mask).unsqueeze(-1)
    label_mask = label_mask.unsqueeze(-1)

    pos_score = torch.sum(torch.log(probs+1e-15) * label_mask, dim=0) - torch.sum(torch.log(1 - probs + 1e-15) * label_mask, dim=0)
    neg_score = torch.sum(torch.log(1 - probs +1e-15) * scale_factor, dim=0) - torch.sum(torch.log(probs+1e-15) * scale_factor, dim=0)
    return pos_score + neg_score

总结一下，Automatic Verbalizer是在充分使用了预训练语言模型具有丰富的语义知识地基础上，使用了统计学方法（指的是两种损失）在预训练语言模型的词汇空间中为每个标签搜索合适的标签词。你说缺点吗？我能想到的一个就是搜索到的标签词受限于预训练语言模型训练所使用的词表，第二个就是从这篇文章的源码来看，并没有进行一个迭代搜索，迭代搜索会不会更准确呢？实现代码 openprompt框架里的AutomaticVerbalizer类的实现代码：

from transformers.tokenization_utils import PreTrainedTokenizer
from openprompt.data_utils import InputFeatures
from openprompt import Verbalizer
from typing import List, Optional, Dict
import torch
import torch.nn as nn
import torch.nn.functional as F
from openprompt.utils.logging import logger


class AutomaticVerbalizer(Verbalizer):
    r"""
    This implementation is slightly different from the original code in that
    1). we allow re-selecting the verbalizer after a fixed training steps.
    The original implementation only performs one step selection after getting
    the initial logits on the training data. To adopt their implementation,
    please only do ``optimize()`` after the first pass of training data.

    2). We strictly follows the probility calculation in Equation (3) in the
    paper, which take softmax over the logits.

    3). We do not implements the ``combine_patterns'' if-branch. Since it's
    not a pure verbalizer type, and doesn't yield much improvement. However,
    it can be achieve by using EnsembleTrainer to pass text wrapped by
    multiple templates together with this verbalizer.

    We use a probs_buffer to store the probability :math:`q_{P,t}(1|\mathbf{x})` that to be used in later verbalizer selection,
    and a label_buffer to store the label :math:`y` that to be used in later verbalizer selection.

    Args:
        num_candidates (:obj:`int`, optional): the number of candidates for further selection based on Section 4.1
        label_word_num_per_class (:obj:`int`, optional): set to be greater than 1 to support Multi-Verbalizers in Section 4.2
        num_searches (:obj:`int`, optional): Maximnum number of label_words search. After reaching this number, the verbalizer will use the same label_words as the previous iterations.
        search_id (:obj:`int`, optional): the id of current search, used to determine when to stop label words searching.
        score_fct (:obj:`str`, optional): the scoring function of label words selection. ``llr`` means log likelihood ratio, corresponding to Equation (7); ``ce`` means cross entropy, corresponding to Equation (6). As the paper points out, ``llr'' is significantly better than 'ce', we only keep it to match the original code.
        balance (:obj:`book`, optional): whether to perform normalization of unbalanced training dataset, as Equation (5).
    """
    def __init__(self,
                 tokenizer: PreTrainedTokenizer = None,
                 num_candidates: Optional[int]= 1000,
                 label_word_num_per_class: Optional[int] = 1,
                 num_searches: Optional[int] = 1,
                 score_fct: Optional[str] = 'llr',
                 balance: Optional[bool] = True,
                 num_classes: Optional[bool] = None,
                 classes: Optional[List[str]] = None,
                 init_using_split: Optional[str] = "train",
                 **kwargs):
        super().__init__(num_classes=num_classes, tokenizer = tokenizer, classes=classes)
        self.num_candidates = num_candidates
        self.label_word_num_per_class = label_word_num_per_class
        self.probs_buffer, self.labels_buffer = None, None
        assert num_searches > 0, "You requires the verbalizer to perform {} searches. Invalid.".format(num_searches)
        self.num_searches = num_searches
        self.search_id = 0
        self.accumulate_step = 0 # currently not used, to support not epoch-level optimize.
        self.accumulate = True # A flag to indicate whether to
                               # accumulate examples for optimization.
                               # set to False after finish optimization.
        self.score_fct = score_fct
        self.balance = balance
        self.init_using_split = init_using_split


    def register_buffer(self, logits, labels):
        r'''
        将模型输出的logits和对应的label存储到缓冲区中
        Args:
            logits (:obj:`torch.Tensor`): 模型的输出logits
            labels (:obj:`List`): 对应的标签
        '''
        logits = F.softmax(logits.detach(), dim=-1)
        labels = labels.detach()
        if self.probs_buffer is None :
            self.probs_buffer = logits
            self.labels_buffer = labels
        else:
            self.probs_buffer = torch.vstack([self.probs_buffer, logits])
            self.labels_buffer = torch.hstack([self.labels_buffer, labels])

    
    def process_logits(self, logits: torch.Tensor, **kwargs):

        if self.accumulate: # inherit from nn.Module, only store buffer in training mode.
            self.accumulate_step += 1
            self.register_buffer(logits, kwargs['batch']['label'])

        if hasattr(self, "label_words_ids"): # TODO the content in this "if" is same as super()
            # project
            label_words_logits = self.project(logits, **kwargs)  #Output: (batch_size, num_classes) or  (batch_size, num_classes, num_label_words_per_label)
            # normalize
            label_words_probs = self.normalize(label_words_logits)
            # calibrate
            if  hasattr(self, "_calibrate_logits") and self._calibrate_logits is not None:
                label_words_probs = self.calibrate(label_words_probs=label_words_probs)
            # convert to logits
            label_words_logits = torch.log(label_words_probs+1e-15)
            # aggregate
            if label_words_logits.dim() > 2:
                label_logits = self.aggregate(label_words_logits)
            else:
                label_logits = label_words_logits
            return label_logits
        else:
            return torch.randn((logits.size(0), self.num_classes), requires_grad=True).to(logits.device)

    
    def project(self,
                logits: torch.Tensor,
                **kwargs, # TODO
                ) -> torch.Tensor:
        r"""When this verbalizer hasn't perform optimize(), it has no
        ``label_words_ids``, thus will give random predictions, and should
        have no connection to the model to give (miss-leading) grads.

        Args:
            logits (:obj:`torch.Tensor`): The original logits over the vocabulary.

        Returns:
            :obj:`torch.Tensor`: The projected logits of label words.
        """
        label_words_logits = logits[:, self.label_words_ids]
        return label_words_logits


    def optimize(self):
        pass


    def optimize_to_initialize(self):
        r"""This is an epoch-level optimize. If used in batch-level like an ordinary
        gradient descend optimizer, the result may not be very satisfying since the accumated
        examples (i.e., the probs_buffer and the labels_buffer) are not enough if the batchsize
        is small.
        """
        if self.search_id < self.num_searches:
            self.label_words_ids = self._find_verbalizer(words_per_label=self.label_word_num_per_class,
                                                         num_candidates=self.num_candidates,
                                                         score_fct=self.score_fct,
                                                         balance=self.balance)
            self.probs_buffer, self.labels_buffer = None, None
            self.search_id += 1
            if self.search_id == self.num_searches: # finish optimization
                self.accumulate = False
        else:
            print("Verbalizer's max num_searches reached, use the previous label words.")
        self._show_verbalizer()


    def _show_verbalizer(self):
        tokens = [self.tokenizer.convert_ids_to_tokens(i) for i in self.label_words_ids]
        # logger.info("Verbalizer is {}".format(tokens))
        print("Verbalizer is {}".format(tokens))

    
    def _find_verbalizer(self, words_per_label: int = 1, num_candidates: int = 1000, balance: bool = True,
                         score_fct: str = 'llr'):
        print("Find verbalizer...")
        probs = self.probs_buffer
        print("probs：", probs)
        print(probs.shape)
        labels = self.labels_buffer
        print("labels：", labels)
        print(labels.shape)
        candidates = self._get_candidates(num_candidates=num_candidates, probs=probs, labels=labels)
        label_words =  self._get_top_words(probs=probs, candidates=candidates, balance=balance, words_per_label=words_per_label,
                                    score_fct=score_fct)
        return label_words

    
    def _get_candidates(self,
                        num_candidates: int,
                        probs: torch.Tensor,
                        labels: torch.Tensor,
                        ) -> Dict[str, List[str]]:
        if num_candidates <= 0:
            return [torch.arange(self.vocab_size) for label_id in range(self.num_classes)]

        log_probs = torch.log(probs+1e-15)
        candidate_ids = []
        for label_id in range(self.num_classes):
            label_mask = (labels==label_id).to(torch.float).unsqueeze(-1)
            score = torch.sum(log_probs * label_mask, dim=0)
            candidate_id = torch.argsort(score, descending=True)[:num_candidates]
            candidate_ids.append(candidate_id)
        return candidate_ids

    
    def _get_top_words(self,
                       probs: torch.Tensor,
                       candidates: List[torch.Tensor],
                       balance: bool = True,
                       words_per_label: int = 10,
                       score_fct: Optional[str] = 'llr'):
        label_words_ids = []
        for label_id in range(self.num_classes):
            label_mask = (self.labels_buffer==label_id).to(torch.float)
            probs_per_label = probs[:, candidates[label_id]]
            if score_fct == 'llr':
                s = self._log_likelihood_ratio(probs_per_label, label_mask, balance)
            elif score_fct == 'ce':
                s = self._cross_entropy(probs_per_label, label_mask, balance)
            else:
                raise ValueError(f"Score function '{score_fct}' not implemented")
            sorted_ids = torch.argsort(s, descending=True)[:words_per_label]
            selected_ids = candidates[label_id][sorted_ids]
            label_words_ids.append(selected_ids)
        label_words_ids = torch.vstack(label_words_ids)
        return label_words_ids

    
    def _log_likelihood_ratio(self, probs, label_mask, balance):
        if balance:
            scale_factor =  torch.sum(label_mask) / torch.sum(1 - label_mask) \
                            * (1-label_mask).unsqueeze(-1)
        else:
            scale_factor = (1-label_mask).unsqueeze(-1)
        label_mask = label_mask.unsqueeze(-1)

        pos_score = torch.sum(torch.log(probs+1e-15) * label_mask, dim=0) - torch.sum(torch.log(1 - probs + 1e-15) * label_mask, dim=0)
        neg_score = torch.sum(torch.log(1 - probs +1e-15) * scale_factor, dim=0) - torch.sum(torch.log(probs+1e-15) * scale_factor, dim=0)
        return pos_score + neg_score

    
    def _cross_entropy(self, probs, label_mask, balance):
        if balance:
            scale_factor =  torch.sum(label_mask) / torch.sum(1 - label_mask) \
                            * (1-label_mask).unsqueeze(-1)
        else:
            scale_factor = (1-label_mask).unsqueeze(-1)
        label_mask = label_mask.unsqueeze(-1)

        pos_score = torch.sum(torch.log(probs+1e-15) * label_mask, dim=0)
        neg_score = torch.sum(torch.log(1 - probs +1e-15) * scale_factor, dim=0)
        return pos_score + neg_score

    
    def from_file(self,
                  path: str,
                  choice: Optional[int] = 0 ):
        raise NotImplementedError("This verbalizer is learned and can't be set from file.")

对于AutomaticVerbalizer类的代码，__init__函数可以根据后续需要修改定义一些参数；register_buffer函数用来缓冲存储probs和labels，不需要修改；process_logits函数也不需要修改；project函数用来获取每个类的标签词的概率分布矩阵，也不需要修改；__show_verbalizer函数也不需要修改。 optimize_to_initialize函数只知道是用来调用迭代的（调用_find_verbalizer函数）。但是在它的注释里有这样一句话：这是epoch级别的优化。如果像普通的梯度下降优化器一样在批处理级别使用，结果可能不是很令人满意，因为如果batchsize比较小，则累积的示例（即probs_buffer和labels_buffer）是不够的。 _find_verbalizer函数是用来调用_get_candidates函数和_get_top_words函数的。 _get_candidates函数是从预训练语言模型训练所使用的词表中寻找每个类的候选标签词，返回的是候选标签词的ID列表。 _get_top_words函数则是从候选的标签词列表中寻找每个类的最终数量的标签词。但是怎么去寻找呢？用损失函数计算候选标签词在k-shot样本的一个分数，然后做一个排序取前words_per_label的候选标签词作为最后的标签词。 _log_likelihood_ratio或者_cross_entropy函数都是损失函数。

本文作者：爱编码的懒虫

本文链接：https://www.cnblogs.com/jokewl/p/18630042