Loading

论文阅读 不同下游模型对预训练模型编码效果的判别力 Designing and Interpreting Probes with Control Tasks

Notations[1]

  • Token = Representaion

But does this mean that the representations encode linguistic structure or just that the probe has learned the linguistic task?

We also find that dropout, commonly used to control probe complexity, is ineffective for improving selectivity of MLPs, but that other forms of regularization are effective.

Even so, as long as a representation is a lossless encoding,
a sufficiently expressive probe with enough
training data can learn any task on top of it.

对照任务 control tasks

\(f(\mathbf{x}_{1:T}) = \mathbf{y}_{1:T}, y_i \in \mathcal{Y}\)

Control Behavior 意思就是一个人为指定的Mapping

  • \(f_{\text{POS tagging}} : \mathbf{x} \to \mathcal{Y}\)
  • \(f_{\text{Control Mapping}}: \mathbf{x} \to \mathcal{Y}\)

对照实验的本质

  • “祝你今天愉快”,这是\(x\)序列
  • 对应的真实词性是\(y\)
  • 对照任务当中的输出结果是\(C(x)\)
  • 需要注意的是,实际输入网络的是\(\text{Encoder}(x)\)

Understanding intermediate layers using linear classifier probes.
https://arxiv.org/abs/1610.01644

对照任务的实际意义

The selectivity of a probe puts linguistic task accuracy in context with the probe’s capacity to memorize from word types.
实际上是对编码器编码效果的判别能力
也可以叫Discrimination Power

  1. With popular hyperparameter settings, MLP probes achieve very low selectivity, suggesting caution in interpreting how their results reflect properties of representations. For example, on part-of-speech tagging, 97:3 accuracy is achieved, compared to 92:8 control task accuracy, resulting in 4:5 selectivity.
  2. Linear and bilinear probes achieve relatively high selectivity across a range of hyperparameters. For example, a linear probe on part-ofspeech tagging achieves a similar 97:2 accuracy, and 71:2 control task accuracy, for 26:0 selectivity. This suggests that the small accuracy gain of the MLP may be explained by increased probe expressivity(Model Complexity).
  3. The most popular method for controlling probe complexity, dropout, does not consistently lead to selective MLP probes. However, control of MLP complexity through unintuitively small (10-dimensional) hidden states, as well as small training sample sizes and weight decay这些也可以视作防止模型过拟合的方法), lead to higher selectivity and similar linguistic task accuracy.

  1. paper ↩︎

posted @ 2021-11-10 17:28  ZXYFrank  阅读(45)  评论(0编辑  收藏  举报