(三) DP-SGD 算法解释

We are starting a series of blog posts on DP-SGD that will range from gentle introductions to detailed coverage of the math and of engineering details in making it work.

我们将开始撰写关于 DP-SGD 的一系列博客文章,内容从简单介绍到详细介绍数学和工程细节,以使其发挥作用。

Introduction

In this first entry, we will go over the DP-SGD algorithm focusing on introducing the reader to the core concepts, without worrying about mathematical precision or implementation just yet (they will be covered in future episodes). The intended audience for this article is someone who has experience with training ML models, including deep nets via backpropagation using their favorite framework (PyTorch, of course 🙂).

在第一篇文章中,我们将介绍 DP-SGD 算法,重点是向读者介绍核心概念,而不用担心数学精度或实现(它们将在以后的剧集中介绍)。 本文的目标读者是具有训练 ML 模型经验的人,包括使用他们最喜欢的框架(当然是 PyTorch 🙂)通过反向传播训练深度网络。

Privacy in ML

We know that training a model is an attempt at induction: we learn something from our data and we plan to use it to predict something else in the future. To state the obvious plainly, this means that there is some information in our dataset, and by training a model we condense at least some of it into an artifact we plan to use later. We learn in Machine Learning 101 that memorization can happen, so it’s perhaps not surprising that memorization can indeed be exploited to extract information about training data from a model (see eg [Carlini et al, 2018][Feldman 2020]).

我们知道训练模型是一种归纳尝试:我们从数据中学习一些东西,并计划用它来预测未来的其他东西。 简单地说,这意味着我们的数据集中有一些信息,通过训练模型,我们至少可以将其中的一些信息压缩成我们计划稍后使用的工件。 我们在机器学习 101 中了解到可以发生记忆,因此确实可以利用记忆从模型中提取有关训练数据的信息(参见例如 [Carlini et al, 2018]、[Feldman 2020]),这也许并不奇怪。

What is privacy, anyway? Let’s say we don’t know and let’s start fresh by looking at our problem. We can all agree that if the ML model has never seen your data in the first place, it cannot possibly violate your privacy. Let’s call this our baseline scenario. Intuitively, if you now added your own data to the training set and the resulting model changed a lot compared to the baseline, you would be concerned about your privacy. While this makes intuitive sense, the real world is more complex than this. In particular, there are two problems we should think about:

究竟什么是隐私? 假设我们不知道,让我们重新审视我们的问题。 我们都同意,如果 ML 模型从一开始就从未见过您的数据,那么它就不可能侵犯您的隐私。 让我们称之为我们的基线场景。 直观地说,如果您现在将自己的数据添加到训练集中,并且生成的模型与基线相比发生了很大变化,您就会担心自己的隐私。 虽然这很直观,但现实世界比这更复杂。 具体来说,有两个问题值得我们思考:

  1. We know that any tweak in the training process, no matter how trivial, will change the resulting model significantly. Permuting the training data, rerandomizing initial parameters, or running another task on the same GPU will produce a different model with potentially very different weights. This means that we can’t simply measure how different the weights are in these two scenarios as that will never work.
  2. If everyone expected that absolutely no change would happen in a model if they added their data, it means that there would be no training data and hence no ML models! We can see that this constraint is a bit too rigid.
  3. 我们知道训练过程中的任何调整,无论多么微不足道,都会显着改变结果模型。 排列训练数据、重新随机化初始参数或在同一 GPU 上运行另一个任务将产生具有可能非常不同的权重的不同模型。 这意味着我们不能简单地衡量这两种情况下的权重有多大不同,因为这永远行不通。
    如果每个人都期望在添加数据后模型中绝对不会发生任何变化,这意味着将没有训练数据,因此也就没有 ML 模型! 我们可以看到这个约束有点过于僵化了。

Luckily for us, this was figured out by [Dwork et al, 2006] and the resulting concept of differential privacy provides a solution to both problems! For the first, rather than comparing the weights of the two models, we want to consider the probabilities of observing these weights. For the second, instead of insisting that nothing will change, let’s instead promise that while something will change, we guarantee it will never change by more than a specific and predefined amount. This way, we won’t learn too much to be nosy, but we can still learn enough to produce useful models.

对我们来说幸运的是,[Dwork 等人,2006] 发现了这一点,由此产生的差分隐私概念为这两个问题提供了解决方案! 首先,我们不想比较两个模型的权重,而是要考虑观察这些权重的概率。 其次,与其坚持什么都不会改变,不如让我们承诺,虽然某些事情会发生变化,但我们保证它的变化永远不会超过特定和预定义的数量。 这样,我们就不会学到太多的八卦,但我们仍然可以学到足够的知识来产生有用的模型。

These two principles are embodied in the definition of differential privacy which goes as follows. Imagine that you have two datasets D and D′ that differ in only a single record (e.g., my data) and you interact with the data via a process or mechanism called M (this can be anything, more on this later). We can say that M is ε-differentially private if for every possible output x, the probability that this output is observed never differs by more than exp(ε) between the two scenarios (with and without my data).

这两个原则体现在差分隐私的定义中,如下所示。 想象一下,您有两个数据集 D 和 D',它们仅在单个记录(例如,我的数据)上有所不同,并且您通过称为 M 的过程或机制与数据交互(这可以是任何内容,稍后会详细介绍)。 我们可以说 M 是 ε-差异私有的,如果对于每个可能的输出 x,观察到这个输出的概率在两种场景(有和没有我的数据)之间的差异永远不会超过 exp(ε)。

Or, if you prefer a formula:

∀ D and D′ that differ in one person’s data ∀ x: ℙ[M(D) = x] ≤ exp(ε) ⋅ ℙ[M(D′) = x]

One of the amazing things about differential privacy is that it imposes no limitations on the nature on M. It can be anything. It can be a database query, it can be asking a set of questions with pen and paper to a person, or even just storing it to disk or sending it over wire, or anything else you want. As long as M enjoys this property over its outputs, then it can claim its DP badge for a specific privacy budget ε. At the same time, you can choose what ε you want to be certified for: the higher it is, the less private you are (look at the formula: it means the probabilities are allowed to diverge more). For this reason, the quantity ε is commonly referred to as the privacy [loss] budget.

差分隐私的惊人之处之一是它对 M 的性质没有限制。它可以是任何东西。 它可以是一个数据库查询,它可以是用笔和纸向一个人提出一组问题,或者甚至只是将它存储到磁盘或通过有线发送,或者任何你想要的东西。 只要 M 在其输出上享有此属性,那么它就可以为特定的隐私预算 ε 声明其 DP 徽章。 同时,你可以选择你想要认证的ε:越高,你越不私密(看公式:这意味着允许概率偏差更大)。 出于这个原因,数量 ε 通常被称为隐私 [损失] 预算。

If we go back to our case of training a model, we now have a way to formally certify the privacy level of our training algorithm: if, after training two models, one of which on all data (mine included) and the other on all data except from mine, we can prove that all weights of the two models are observed with probabilities that lie within a predefined boundary of exp(ε) of each other, then we can claim the cool DP badge for our training procedure (that’s right! It’s the overall process that gets the badge, not the data and certainly not the trained model!).

如果我们回到训练模型的案例,我们现在有一种方法可以正式证明我们训练算法的隐私级别:如果在训练两个模型之后,其中一个针对所有数据(包括我的),另一个针对所有数据 除了我的数据,我们可以证明两个模型的所有权重都以位于彼此 exp(ε) 的预定义边界内的概率观察到,然后我们可以为我们的训练过程声明很酷的 DP 徽章(没错! 获得徽章的是整个过程,而不是数据,当然也不是经过训练的模型!)。

Notice that this task is harder than it looks: we can’t simply try 1000 examples (or a million, or a billion) and check whether they match. We need to prove this for all values, including never previously observed ones. The only way out of this is math and theorems. The good news about this is that if somehow we do manage to get this done, then we know that no matter what, the privacy claim is always true. There can never be any future attack that will extract our precious data from a trained module, nor any bugs to exploit to circumvent our defense just like you can’t break Pythagoras’s theorem, so this is why it’s worth doing.

请注意,这项任务比看起来更难:我们不能简单地尝试 1000 个示例(或一百万或十亿)并检查它们是否匹配。 我们需要为所有值证明这一点,包括以前从未观察到的值。 解决这个问题的唯一方法是数学和定理。 好消息是,如果我们以某种方式设法完成了这项工作,那么我们知道无论如何,隐私声明总是正确的。 永远不会有任何未来的攻击会从训练有素的模块中提取我们宝贵的数据,也不会有任何漏洞可以利用来绕过我们的防御,就像你不能打破毕达哥拉斯定理一样,所以这就是值得做的原因。

Providing a guarantee

So, how do we provide this guarantee then? The definition doesn’t say anything about the how.

那么,我们如何提供这种保证呢? 该定义没有说明如何。

It’s helpful to think about this problem on a simpler domain, so for now let us leave machine learning aside and focus on making private counting queries to a database — at the end of the day, we can see ML training as a special case of querying the training data to get numerical results out.

在更简单的领域考虑这个问题会很有帮助,所以现在让我们把机器学习放在一边,专注于对数据库进行私有计数查询——在一天结束时,我们可以将 ML 训练视为查询的一个特例 训练数据以获得数值结果。

It is trivial to see that  queries can lead to a complete privacy breakdown against a sufficiently determined attacker. Consider the following example of a database that consists of two fields  and , with the latter being kept “private” by mandating it can only be shown in aggregates. By repeatedly running queries such as , Alice’s salary can be recovered with binary search. Can we defend against this attack by disallowing queries that target individuals? If only! A pair of queries  and  get the job done just as easily as before.

很容易看到 COUNT(*) WHERE <cond> 查询可以导致针对足够确定的攻击者的完整隐私破坏。 考虑以下数据库示例,该数据库包含两个字段 name 和salary,后者通过强制要求只能在聚合中显示而保持“私有”。 通过重复运行诸如 COUNT(*) WHERE name="Alice" 和salary < X 之类的查询,Alice 的工资可以通过二分查找恢复。 我们能否通过禁止针对个人的查询来抵御这种攻击? 要是! 一对查询 COUNT(*) WHERE name<>"Alice" 和salary < X 和 COUNT(*) WHERE sale < X 可以像以前一样轻松地完成工作。

It may seem that these simple attacks can be thwarted by making the server’s answers a bit less precise. For instance, what if the server rounds its responses to the closest multiple of 10? Or, to confuse the attacker even more, chooses the rounding direction randomly?

A seminal result from the early 2000s due to Irit Dinur and Kobbi Nissim states, loosely, that too many accurate answers to too many questions will violate privacy almost surely. This phenomenon is known in the literature as Fundamental Law of Information Recovery and has been practically demonstrated in a variety of contexts time and time again. It effectively means that not only the answers cannot be overly precise, the error must grow with the number of answers if we want to avoid nearly total reconstruction of the dataset.

The notion of differential privacy turns these observations into actionable guidance.

The remarkable fact is that we can enforce differential privacy for counting queries by simply computing the precise answer and adding noise randomly sampled from a carefully chosen probability distribution. In its simplest form, a privacy-preserving mechanism can be implemented with a noise drawn from the Laplace distribution.

Of course, by asking the same query multiple times, the additive noise will average out and the true answer will emerge, which is exactly what Dinur-Nissim warned us about. Take that, differential privacy!

Differential privacy allows us to analyze this effect too, and in a very neat way: if you take a measurement from a mechanism with privacy budget ε₁ and a second measurement from another mechanism with privacy budget ε₂​, then the total privacy budget will be simply ε₁​+ε₂​. Sleek, eh? This property is called (simple) composition. This means that if the mechanism guarantees that a single query has ε=1 and you want to issue three queries, the total privacy budget expended will be ε=3.

This “just add some noise” business sounds too good to be true, right? What if the attacker thinks really hard about the output of a differentially private computation, such as feeding it into a custom-made neural network trained to break privacy? Fear not! Differential privacy is preserved by post-processing, which means that results of running arbitrary computations on top of differentially private output won’t roll back the ε. Not only does it protect against clever adversaries, it gives us a lot of flexibility in designing differentially private mechanisms: once differential privacy is enforced anywhere in the data processing pipeline, the final output will satisfy differential privacy.

To recap, we learned that our solution will look like this:

  1. Our mechanism will be randomized, i.e., it will use noise.
  2. Our final privacy claim depends on the total number of interactions with the data.
  3. We can post-process results of a differentially private computation any way we want (as long as we don’t peek into the private dataset again).

Back to machine learning

To apply the concept of differential privacy to the original domain of machine learning, we need to land on two decisions: how we define “one person’s data” that separates D from D’ and what the mechanism M is.

Since in most applications of ML the inputs come without explicit user identifiers, with Federated Learning being one notable exception, we will default to protecting privacy of a single sample in the training dataset. We will discuss other options in future Medium posts.

As for the mechanism M, one possibility is to consider privacy of the model’s outputs only. This is indeed a valid option called private prediction, but it comes with many strings attached: the model can still memorize, so it’s up to your inference system to securely enforce those constraints. Also, this prevents us from ever releasing our ML model: if someone gets to see the weights, our privacy guarantees will be lost. This means that deploying on mobile will be considerably less safe, among others.

For this reason, it would be much preferable if we could instead insert the DP mechanism during model training, so that the resulting model could be safe for release. This brings us to the DP-SGD algorithm. (There is evidence that even when you only care about accuracy, private training still beats private prediction. See [van der Maaten, Hannun 2020] for a practical analysis and more discussion on the topic).

DP-SGD

DP-SGD (Differentially-Private Stochastic Gradient Descent) modifies the minibatch stochastic optimization process that is so popular with deep learning in order to make it differentially private.

The core idea is that training a model in PyTorch can be done through access to its parameter gradients, i.e., the gradients of the loss with respect to each parameter of your model. If this access preserves differential privacy of the training data, so does the resulting model, per the post-processing property of differential privacy.

There is also an engineering angle here: since the PyTorch optimizer is already made to look at parameter gradients, we could add this noise business directly into it and we can hide away the complexity, allowing anyone to train a differentially private model simply. Profit!

This code sample can show how simple this is

optimizer = torch.optim.SGD(lr=args.lr)

for batch in Dataloader(train_dataset, batch_size=32):
    x, y = batch
    y_hat = model(x)
    loss = criterion(y_hat, y)
    loss.backward()
    
    # Now these are filled:
    gradients = (p.grad for p in model.parameters())
  
    for p in model.parameters():

        # Add our differential privacy magic here
        p.grad += torch.normal(mean=0, std=args.sigma)
        
        # This is what optimizer.step() does
        p = p - args.lr * p.grad
        p.grad.zero_()

We have only one question left: how much noise should we be adding? Too little and we can’t respect privacy, too much and we are left with a private but useless model. This turns out to be more than a minor issue. Our ambition is to guarantee that we respect the privacy of each and every sample, not of every batch (since these aren’t a meaningful unit privacy-wise). We’ll cover the details in a future installment of this series, but the intuition is very straightforward: the right answer depends on the largest norm of the gradient in a minibatch, as that is the sample that is at most risk of exposure.

We need to add just enough noise to hide the largest possible gradient so that we can guarantee that we respect the privacy of each and every sample in that batch. To this end, we use the Gaussian mechanism that takes in two parameters, the noise multiplier and the bound on the gradient norm. But wait… The gradients that arise during training of a deep neural network are potentially unbounded. In fact, for outliers and mislabeled inputs they can be very large indeed. What gives?

If the gradients are not bounded, we’ll make them so ourselves! Let C be the target bound for the maximum gradient norm. For each sample in the batch, we compute its parameter gradient and if its norm is larger than C, we clip the gradient by scaling it down to C. Mission accomplished — all the gradients now are guaranteed to have norm bounded by C, which we naturally call the clipping threshold. Intuitively, this means that we disallow the model from learning more information than a set quantity from any given training sample, no matter how different it is from the rest.

This requires computing parameter gradients for each sample in a batch. We normally refer to them as per-sample gradients. Let’s spend a little more time here as these are a quantity that is normally not computed: usually, we process data in batches (in the code snippet above, the batch size is 32). The parameter gradients we have in  are the average of the gradients for each example, which is not what we want: we want 32 different  tensors, not their average into a single one.

In code:

optimizer = torch.optim.SGD(lr=args.lr)

for batch in Dataloader(train_dataset, batch_size=32):
    all_per_sample_gradients = [] # will have len = batch_size
    for sample in batch:
        x, y = sample
        y_hat = model(x)
        loss = criterion(y_hat, y)
        loss.backward()  # Now p.grad for this x is filled
        
        # Need to clone it to save it
        per_sample_gradients = [p.grad.detach().clone() for p in model.parameters()]
        
        all_per_sample_gradients.append(per_sample_gradients)
        model.zero_grad()  # p.grad is cumulative so we'd better reset it

Computing per-sample gradients like in the snippet above seems slow, and it is as it forces us to run backward steps for one example at a time, thus losing the benefit of parallelization. There is no standard way around this as once we look into , the per-sample information will have been already lost. It is however at least correct — a batch gradient is a per-sample gradient if . This method is called the microbatch method and it offers simplicity and universal compatibility (every possible layer is automatically supported) at the cost of training speed. Our library, Opacus, uses a different method that is much faster, at the cost of doing some extra engineering work. We will cover this method in-depth in a followup Medium. For now, let’s stick to microbatching.

Opacus (https://opacus.ai/) is a library that enables training PyTorch models with differential privacy

Putting it all together, we want to:

  1. Compute the per-sample gradients
  2. Clip them to a fixed maximum norm
  3. Aggregate them back into a single parameter gradient
  4. Add noise to it

Here’s some sample code to do just that:

from torch.nn.utils import clip_grad_norm_

optimizer = torch.optim.SGD(lr=args.lr)

for batch in Dataloader(train_dataset, batch_size=32):
    for param in model.parameters():
        param.accumulated_grads = []
    
    # Run the microbatches
    for sample in batch:
        x, y = sample
        y_hat = model(x)
        loss = criterion(y_hat, y)
        loss.backward()
    
        # Clip each parameter's per-sample gradient
        for param in model.parameters():
            per_sample_grad = p.grad.detach().clone()
            clip_grad_norm_(per_sample_grad, max_norm=args.max_grad_norm)  # in-place
            param.accumulated_grads.append(per_sample_grad)  
        
    # Aggregate back
    for param in model.parameters():
        param.grad = torch.stack(param.accumulated_grads, dim=0)

    # Now we are ready to update and add noise!
    for param in model.parameters():
        param = param - args.lr * param.grad
        param += torch.normal(mean=0, std=args.noise_multiplier * args.max_grad_norm)
        
        param.grad = 0  # Reset for next iteration

 

This already gives a good idea of how to implement the DP-SGD algorithm, although this is clearly suboptimal and (as we shall see) not fully secure. In future Medium posts, we will cover how we bring back parallelization to DP-SGD, add support for cryptographically secure randomness, analyze the algorithm’s differential privacy, and finally train some models. Stay tuned!

To learn more about Opacus, visit opacus.ai and github.com/pytorch/opacus.

https://medium.com/pytorch/differential-privacy-series-part-1-dp-sgd-algorithm-explained-12512c3959a3

介绍

在第一篇文章中,我们将介绍 DP-SGD 算法,重点是向读者介绍核心概念,而不用担心数学精度或实现(它们将在以后的剧集中介绍)。本文的目标读者是具有训练 ML 模型经验的人,包括使用他们最喜欢的框架(当然是 PyTorch 🙂)通过反向传播训练深度网络。

机器学习中的隐私

我们知道训练模型是一种归纳尝试:我们从数据中学习一些东西,并计划用它来预测未来的其他东西。简单地说,这意味着我们的数据集中有一些信息,通过训练模型,我们至少可以将其中的一些信息压缩成我们计划稍后使用的工件。我们在机器学习 101 中了解到记忆可能发生,因此确实可以利用记忆从模型中提取有关训练数据的信息(参见例如[Carlini et al, 2018][Feldman 2020],这也许并不奇怪

究竟什么是隐私?假设我们不知道,让我们重新审视我们的问题。我们都同意,如果 ML 模型从一开始就从未见过您的数据,那么它就不可能侵犯您的隐私。让我们称之为我们的基线场景。直观地说,如果您现在将自己的数据添加到训练集中,并且生成的模型与基线相比发生了很大变化,您就会担心自己的隐私。虽然这很直观,但现实世界比这更复杂。具体来说,有两个问题值得我们思考:

  1. 我们知道,训练过程中的任何调整,无论多么微不足道,都会显着改变生成的模型。排列训练数据、重新随机化初始参数或在同一 GPU 上运行另一个任务将产生具有可能非常不同的权重的不同模型。这意味着我们不能简单地衡量这两种情况下权重的不同,因为这永远不会奏效。
  2. 如果每个人都期望在添加数据后模型中绝对不会发生任何变化,那么这意味着将没有训练数据,因此也就没有 ML 模型!我们可以看到这个约束有点过于僵化了。

对我们来说幸运的是,[Dwork et al, 2006] 发现了这一点,由此产生的差分隐私概念为这两个问题提供了解决方案!首先,我们不想比较两个模型的权重,而是要考虑观察这些权重概率第二,与其坚持什么都不会改变,不如让我们承诺,虽然某些事情会发生变化,但我们保证它的变化永远不会超过特定和预定义的数量。这样,我们就不会学到太多八卦,但我们仍然可以学到足够的知识来产生有用的模型。

这两个原则体现在差分隐私的定义中,如下所示。想象一下,您有两个数据集 D 和 D',它们仅在单个记录(例如,我的数据)中有所不同,并且您通过称为 M的过程或机制与数据交互(这可以是任何东西,稍后会详细介绍)。我们可以说 M 是ε-差异私有的,如果对于每个可能的输出x,观察到该输出的概率在两种情况下(有和没有我的数据)之间的差异永远不会超过exp(ε )。

或者,如果您更喜欢公式:

一个人的数据不同的∀ D 和 D′ ∀ x: ℙ[M(D) = x] ≤ exp( ε ) ⋅ ℙ[M(D′) = x]

差分隐私的惊人之处之一是它对 M 的性质没有限制。它可以是任何东西它可以是一个数据库查询,它可以是用笔和纸向一个人提出一组问题,或者甚至只是将它存储到磁盘或通过网络发送,或者任何你想要的。只要 M 在其输出上享有此属性,那么它就可以为特定的隐私预算ε声明其 DP 徽章同时,你可以选择你想要认证的ε:越高,你就越不私密(看公式:这意味着允许概率偏差更大)。出于这个原因,数量ε通常被称为隐私 [损失] 预算。

如果我们回到训练模型的案例,我们现在有一种方法可以正式证明我们训练算法的隐私级别:如果在训练两个模型之后,其中一个针对所有数据(包括我的),另一个针对所有数据除了我的数据,我们可以证明两个模型的所有权重都以位于彼此exp(ε) 预定义边界内的概率观察到,然后我们可以为我们的训练过程声明很酷的 DP 徽章(没错!获得徽章的是整个过程,而不是数据,当然也不是经过训练的模型!)。

请注意,这项任务比看起来更难:我们不能简单地尝试 1000 个示例(或一百万或十亿)并检查它们是否匹配。我们需要为所有证明这一点,包括以前从未观察到的值。解决这个问题的唯一方法是数学和定理。好消息是,如果我们以某种方式设法完成了这项工作,那么我们知道无论如何,隐私声明总是正确的。永远不会有任何未来的攻击会从训练有素的模块中提取我们宝贵的数据,也不会有任何漏洞可以利用来绕过我们的防御,就像您无法打破毕达哥拉斯定理一样,所以这就是值得这样做的原因。

提供担保

那么,我们如何提供这种保证呢?该定义没有说明如何。

在更简单的领域考虑这个问题会很有帮助,所以现在让我们把机器学习放在一边,专注于对数据库进行私有计数查询——在一天结束时,我们可以将 ML 训练视为查询的一个特例训练数据以获得数值结果。

很容易看出查询可以导致针对一个足够坚定的攻击者的完整隐私破坏。考虑以下由两个字段组成的数据库示例,后者通过强制要求保持“私有”,只能在聚合中显示。通过重复运行查询,例如,Alice 的薪水可以通过二分查找恢复。我们能否通过禁止针对个人的查询来抵御这种攻击?要是!一对查询像以前一样轻松完成工作。

似乎可以通过使服务器的答案不那么精确来阻止这些简单的攻击。例如,如果服务器将其响应四舍五入到最接近的 10 倍数呢?或者,为了让攻击者更加困惑,随机选择舍入方向

由于 Irit Dinur 和 Kobbi Nissim 于 2000 年代初期的开创性结果,松散地说,对太多问题的太多准确答案几乎肯定会侵犯隐私。这种现象在文献中已知信息恢复的基本规律,并在各种环境中被实践证明的时间一次又一次它实际上意味着不仅答案不能过于精确,如果我们想避免几乎完全重建数据集,错误必须随着答案的数量而增加。它实际上意味着不仅答案不能过于精确,如果我们想避免几乎完全重建数据集,错误必须随着答案的数量而增加。

差异隐私的概念将这些观察结果转化为可操作的指导。

值得注意的事实是,我们可以通过简单地计算精确答案并添加从精心选择的概率分布中随机采样的噪声来强制计算查询的差异隐私。在最简单的形式中,隐私保护机制可以通过拉普拉斯分布中的噪声来实现。

当然,通过多次询问相同的查询,加性噪声会被平均化,真正的答案就会出现,这正是 Dinur-Nissim 警告我们的。拿那个,差异隐私!

差分隐私也允许我们以一种非常简洁的方式分析这种影响:如果你从一个隐私预算为ε₁的机制中进行测量,并从另一个隐私预算为ε2 的机制中进行第二次测量,那么总隐私预算将很简单ε₁ + ε₂圆滑,嗯?此属性称为(简单)组合这意味着如果该机制保证单个查询具有ε =1,并且您想要发出三个查询,则花费的总隐私预算将为ε =3。

这种“只是增加一些噪音”的业务听起来好得令人难以置信,对吧?如果攻击者非常认真地考虑差分私有计算的输出,例如将其输入到一个定制的神经网络训练来破坏隐私,那该怎么办?不要害怕!差分隐私由后处理保留,这意味着在差分隐私输出之上运行任意计算的结果不会回滚ε它不仅可以抵御聪明的对手,还为我们设计差分隐私机制提供了很大的灵活性:一旦在数据处理管道的任何地方强制执行差分隐私,最终输出将满足差分隐私。

回顾一下,我们了解到我们的解决方案将如下所示:

  1. 我们的机制将是随机的,即,它将使用噪声。
  2. 我们的最终隐私声明取决于与数据交互的总数。
  3. 我们可以以任何我们想要的方式对差分私有计算的结果进行后处理(只要我们不再窥视私有数据集)。

回到机器学习

要将差分隐私的概念应用于机器学习的原始领域,我们需要做出两个决定:我们如何定义将 D 与 D' 分开的“一个人的数据”以及机制 M 是什么。

由于在 ML 的大多数应用中,输入没有明确的用户标识符,联邦学习是一个明显的例外,我们将默认保护训练数据集中单个样本的隐私。我们将在以后的 Medium 帖子中讨论其他选项。

对于机制 M,一种可能性是仅考虑模型输出的隐私。这确实是一个有效的选项,称为private prediction,但它附带了许多附加条件:模型仍然可以记忆,因此您的推理系统可以安全地强制执行这些约束。此外,这会阻止我们发布我们的 ML 模型:如果有人看到权重,我们的隐私保证将丢失。这意味着在移动设备上部署的安全性将大大降低。

出于这个原因,如果我们可以在模型训练期间插入 DP 机制,这样生成的模型可以安全发布,那就更好了。这将我们带到了 DP-SGD 算法。(有证据表明,即使您只关心准确性,私人训练仍然胜过私人预测。有关该主题的实际分析和更多讨论,请参阅[van der Maaten, Hannun 2020])。

DP-SGD

DP-SGD(Differentially-Private Stochastic Gradient Descent)修改了深度学习中非常流行的小批量随机优化过程,以使其具有差异性私有。

核心思想是在 PyTorch 中训练模型可以通过访问其参数梯度来完成,即损失相对于模型每个参数的梯度。如果此访问保留了训练数据的差异隐私,则根据差异隐私的后处理属性,生成的模型也会保留。

这里还有一个工程角度:由于 PyTorch 优化器已经可以查看参数梯度,我们可以直接将这个噪声业务添加到其中,我们可以隐藏复杂性,让任何人都可以简单地训练差异私有模型。利润!

此代码示例可以显示这是多么简单:

 
optimizer = torch.optim.SGD(lr=args.lr)

for batch in Dataloader(train_dataset, batch_size=32):
    x, y = batch
    y_hat = model(x)
    loss = criterion(y_hat, y)
    loss.backward()
    
    # Now these are filled:
    gradients = (p.grad for p in model.parameters())
  
    for p in model.parameters():

        # Add our differential privacy magic here
        p.grad += torch.normal(mean=0, std=args.sigma)
        
        # This is what optimizer.step() does
        p = p - args.lr * p.grad
        p.grad.zero_()

 

我们只剩下一个问题:我们应该添加多少噪音?太少了,我们就不能尊重隐私,太多了,我们就会留下一个私人但无用的模型。事实证明,这不仅仅是一个小问题。我们的目标是保证我们尊重每个样本的隐私,而不是每个批次的隐私(因为这些在隐私方面并不是一个有意义的单位)。我们将在本系列的后续部分中详细介绍,但直觉非常简单:正确答案取决于小批量中梯度的最大范数,因为这是暴露风险最大的样本。

我们需要添加刚好足够的噪声来隐藏最大可能的梯度,以便我们可以保证我们尊重该批次中每个样本的隐私。为此,我们使用接受两个参数高斯机制,噪声乘数和梯度范数的界限。但是等等……在深度神经网络的训练过程中出现的梯度可能是无限的。事实上,对于异常值和错误标记的输入,它们确实可能非常大。是什么赋予了?

如果梯度没有界限,我们将自己制作它们!令 C 为最大梯度范数的目标界限。对于批次中的每个样本,我们计算其参数梯度,如果其范数大于 C,我们通过将其缩小到 C 来裁剪梯度。自然叫剪裁阈值直观地说,这意味着我们不允许模型从任何给定的训练样本中学习比一组数量更多的信息,无论它与其他样本有多大不同。

这需要为批次中的每个样本计算参数梯度。我们通常将它们称为每样本梯度。让我们在这里多花一点时间,因为这些是通常不会计算的数量:通常,我们分批处理数据(在上面的代码片段中,批大小为 32)。我们的参数梯度是每个示例的梯度的平均值,这不是我们想要的:我们想要 32 个不同的张量,而不是将它们的平均值合并为一个。

在代码中:

 
optimizer = torch.optim.SGD(lr=args.lr)

for batch in Dataloader(train_dataset, batch_size=32):
    all_per_sample_gradients = [] # will have len = batch_size
    for sample in batch:
        x, y = sample
        y_hat = model(x)
        loss = criterion(y_hat, y)
        loss.backward()  # Now p.grad for this x is filled
        
        # Need to clone it to save it
        per_sample_gradients = [p.grad.detach().clone() for p in model.parameters()]
        
        all_per_sample_gradients.append(per_sample_gradients)
        model.zero_grad()  # p.grad is cumulative so we'd better reset it

 

像上面的代码片段那样计算每个样本的梯度似乎很慢,因为它迫使我们一次为一个示例运行后退步骤,从而失去了并行化的好处。没有标准的方法可以解决这个问题,因为一旦我们调查,每个样本的信息将已经丢失。然而,它至少是正确的——如果 ,批处理梯度是每个样本的梯度这种方法称为微批处理方法,它以训练速度为代价提供简单性和通用兼容性(自动支持每个可能的层)。我们的图书馆,Opacus, 使用一种更快的不同方法,代价是做一些额外的工程工作。我们将在后续的 Medium 中深入介绍这种方法。现在,让我们坚持使用微批处理。

Opacus ( https://opacus.ai/ ) 是一个能够训练具有差异隐私的 PyTorch 模型的库

综上所述,我们希望:

  1. 计算每个样本的梯度
  2. 将它们剪辑到固定的最大规范
  3. 将它们聚合回单个参数梯度
  4. 给它添加噪音

下面是一些示例代码来做到这一点:

 
from torch.nn.utils import clip_grad_norm_

optimizer = torch.optim.SGD(lr=args.lr)

for batch in Dataloader(train_dataset, batch_size=32):
    for param in model.parameters():
        param.accumulated_grads = []
    
    # Run the microbatches
    for sample in batch:
        x, y = sample
        y_hat = model(x)
        loss = criterion(y_hat, y)
        loss.backward()
    
        # Clip each parameter's per-sample gradient
        for param in model.parameters():
            per_sample_grad = p.grad.detach().clone()
            clip_grad_norm_(per_sample_grad, max_norm=args.max_grad_norm)  # in-place
            param.accumulated_grads.append(per_sample_grad)  
        
    # Aggregate back
    for param in model.parameters():
        param.grad = torch.stack(param.accumulated_grads, dim=0)

    # Now we are ready to update and add noise!
    for param in model.parameters():
        param = param - args.lr * param.grad
        param += torch.normal(mean=0, std=args.noise_multiplier * args.max_grad_norm)
        
        param.grad = 0  # Reset for next iteration

 

这已经给出了如何实现 DP-SGD 算法的好主意,尽管这显然是次优的并且(正如我们将看到的)不完全安全。在以后的 Medium 帖子中,我们将介绍如何将并行化带回 DP-SGD,添加对加密安全随机性的支持,分析算法的差分隐私,并最终训练一些模型。敬请关注!

要了解有关 Opacus 的更多信息,请访问opacus.aigithub.com/pytorch/opacus

posted @ 2021-09-13 20:53  jasonzhangxianrong  阅读(3815)  评论(0编辑  收藏  举报