代码改变世界

BP反向传播算法的工作原理How the backpropagation algorithm works

2016-03-20 12:56  GarfieldEr007  阅读(983)  评论(0编辑  收藏  举报

In the last chapter we saw how neural networks can learn their weights and biases using the gradient descent algorithm. There was, however, a gap in our explanation: we didn't discuss how to compute the gradient of the cost function. That's quite a gap! In this chapter I'll explain a fast algorithm for computing such gradients, an algorithm known as backpropagation.

The backpropagation algorithm was originally introduced in the 1970s, but its importance wasn't fully appreciated until a famous 1986 paper by David RumelhartGeoffrey Hinton, and Ronald Williams. That paper describes several neural networks where backpropagation works far faster than earlier approaches to learning, making it possible to use neural nets to solve problems which had previously been insoluble. Today, the backpropagation algorithm is the workhorse of learning in neural networks.

This chapter is more mathematically involved than the rest of the book. If you're not crazy about mathematics you may be tempted to skip the chapter, and to treat backpropagation as a black box whose details you're willing to ignore. Why take the time to study those details?

The reason, of course, is understanding. At the heart of backpropagation is an expression for the partial derivative C/w∂C/∂wof the cost function CC with respect to any weight ww (or bias bb) in the network. The expression tells us how quickly the cost changes when we change the weights and biases. And while the expression is somewhat complex, it also has a beauty to it, with each element having a natural, intuitive interpretation. And so backpropagation isn't just a fast algorithm for learning. It actually gives us detailed insights into how changing the weights and biases changes the overall behaviour of the network. That's well worth studying in detail.

With that said, if you want to skim the chapter, or jump straight to the next chapter, that's fine. I've written the rest of the book to be accessible even if you treat backpropagation as a black box. There are, of course, points later in the book where I refer back to results from this chapter. But at those points you should still be able to understand the main conclusions, even if you don't follow all the reasoning.

 

Warm up: a fast matrix-based approach to computing the output from a neural network

 

Before discussing backpropagation, let's warm up with a fast matrix-based algorithm to compute the output from a neural network. We actually already briefly saw this algorithm near the end of the last chapter, but I described it quickly, so it's worth revisiting in detail. In particular, this is a good way of getting comfortable with the notation used in backpropagation, in a familiar context.

Let's begin with a notation which lets us refer to weights in the network in an unambiguous way. We'll use wljkwjkl to denote the weight for the connection from the kthkth neuron in the (l1)th(l−1)th layer to the jthjth neuron in the lthlth layer. So, for example, the diagram below shows the weight on a connection from the fourth neuron in the second layer to the second neuron in the third layer of a network:

This notation is cumbersome at first, and it does take some work to master. But with a little effort you'll find the notation becomes easy and natural. One quirk of the notation is the ordering of the jj and kkindices. You might think that it makes more sense to use jj to refer to the input neuron, and kk to the output neuron, not vice versa, as is actually done. I'll explain the reason for this quirk below.

 

We use a similar notation for the network's biases and activations. Explicitly, we use bljbjl for the bias of the jthjth neuron in the lthlth layer. And we use aljajl for the activation of the jthjth neuron in the lthlth layer. The following diagram shows examples of these notations in use:

With these notations, the activation aljajl of the jthjth neuron in the lthlthlayer is related to the activations in the (l1)th(l−1)th layer by the equation (compare Equation (4) and surrounding discussion in the last chapter)

alj=σ(kwljkal1k+blj),(23)(23)ajl=σ(∑kwjklakl−1+bjl),

where the sum is over all neurons kk in the (l1)th(l−1)th layer. To rewrite this expression in a matrix form we define a weight matrix wlwl for each layer, ll. The entries of the weight matrix wlwl are just the weights connecting to the lthlth layer of neurons, that is, the entry in the jthjth row and kthkth column is wljkwjkl. Similarly, for each layer ll we define a bias vectorblbl. You can probably guess how this works - the components of the bias vector are just the values bljbjl, one component for each neuron in the lthlth layer. And finally, we define an activation vector alal whose components are the activations aljajl.

 

The last ingredient we need to rewrite (23) in a matrix form is the idea of vectorizing a function such as σσ. We met vectorization briefly in the last chapter, but to recap, the idea is that we want to apply a function such as σσ to every element in a vector vv. We use the obvious notation σ(v)σ(v) to denote this kind of elementwise application of a function. That is, the components of σ(v)σ(v) are just σ(v)j=σ(vj)σ(v)j=σ(vj). As an example, if we have the function f(x)=x2f(x)=x2then the vectorized form of ff has the effect

f([23])=[f(2)f(3)]=[49],(24)(24)f([23])=[f(2)f(3)]=[49],

that is, the vectorized ff just squares every element of the vector.

 

With these notations in mind, Equation (23) can be rewritten in the beautiful and compact vectorized form

al=σ(wlal1+bl).(25)(25)al=σ(wlal−1+bl).

This expression gives us a much more global way of thinking about how the activations in one layer relate to activations in the previous layer: we just apply the weight matrix to the activations, then add the bias vector, and finally apply the σσ function**By the way, it's this expression that motivates the quirk in the wljkwjkl notation mentioned earlier. If we used jj to index the input neuron, and kk to index the output neuron, then we'd need to replace the weight matrix in Equation (25) by the transpose of the weight matrix. That's a small change, but annoying, and we'd lose the easy simplicity of saying (and thinking) "apply the weight matrix to the activations".. That global view is often easier and more succinct (and involves fewer indices!) than the neuron-by-neuron view we've taken to now. Think of it as a way of escaping index hell, while remaining precise about what's going on. The expression is also useful in practice, because most matrix libraries provide fast ways of implementing matrix multiplication, vector addition, and vectorization. Indeed, the code in the last chapter made implicit use of this expression to compute the behaviour of the network.

 

When using Equation (25) to compute alal, we compute the intermediate quantity zlwlal1+blzl≡wlal−1+bl along the way. This quantity turns out to be useful enough to be worth naming: we call zlzl theweighted input to the neurons in layer ll. We'll make considerable use of the weighted input zlzl later in the chapter. Equation (25) is sometimes written in terms of the weighted input, as al=σ(zl)al=σ(zl). It's also worth noting that zlzl has components zlj=kwljkal1k+bljzjl=∑kwjklakl−1+bjl, that is, zljzjl is just the weighted input to the activation function for neuron jj in layer ll.

 

The two assumptions we need about the cost function

 

The goal of backpropagation is to compute the partial derivatives C/w∂C/∂w and C/b∂C/∂b of the cost function CC with respect to any weightww or bias bb in the network. For backpropagation to work we need to make two main assumptions about the form of the cost function. Before stating those assumptions, though, it's useful to have an example cost function in mind. We'll use the quadratic cost function from last chapter (c.f. Equation (6)). In the notation of the last section, the quadratic cost has the form

C=12nxy(x)aL(x)2,(26)(26)C=12n∑x∥y(x)−aL(x)∥2,

where: nn is the total number of training examples; the sum is over individual training examples, xx; y=y(x)y=y(x) is the corresponding desired output; LL denotes the number of layers in the network; and aL=aL(x)aL=aL(x) is the vector of activations output from the network when xx is input.

 

Okay, so what assumptions do we need to make about our cost function, CC, in order that backpropagation can be applied? The first assumption we need is that the cost function can be written as an average C=1nxCxC=1n∑xCx over cost functions CxCx for individual training examples, xx. This is the case for the quadratic cost function, where the cost for a single training example is Cx=12yaL2Cx=12∥y−aL∥2. This assumption will also hold true for all the other cost functions we'll meet in this book.

The reason we need this assumption is because what backpropagation actually lets us do is compute the partial derivatives Cx/w∂Cx/∂w and Cx/b∂Cx/∂b for a single training example. We then recover C/w∂C/∂w and C/b∂C/∂b by averaging over training examples. In fact, with this assumption in mind, we'll suppose the training example xx has been fixed, and drop the xx subscript, writing the cost CxCx as CC. We'll eventually put the xx back in, but for now it's a notational nuisance that is better left implicit.

The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network:

For example, the quadratic cost function satisfies this requirement, since the quadratic cost for a single training example xx may be written as

C=12yaL2=12j(yjaLj)2,(27)(27)C=12∥y−aL∥2=12∑j(yj−ajL)2,

and thus is a function of the output activations. Of course, this cost function also depends on the desired output yy, and you may wonder why we're not regarding the cost also as a function of yy. Remember, though, that the input training example xx is fixed, and so the outputyy is also a fixed parameter. In particular, it's not something we can modify by changing the weights and biases in any way, i.e., it's not something which the neural network learns. And so it makes sense to regard CC as a function of the output activations aLaL alone, with yymerely a parameter that helps define that function.

 

 

 

 

 

The Hadamard product, sts⊙t

 

The backpropagation algorithm is based on common linear algebraic operations - things like vector addition, multiplying a vector by a matrix, and so on. But one of the operations is a little less commonly used. In particular, suppose ss and tt are two vectors of the same dimension. Then we use sts⊙t to denote theelementwise product of the two vectors. Thus the components of sts⊙t are just (st)j=sjtj(s⊙t)j=sjtj. As an example,

[12][34]=[1324]=[38].(28)(28)[12]⊙[34]=[1∗32∗4]=[38].

This kind of elementwise multiplication is sometimes called theHadamard product or Schur product. We'll refer to it as the Hadamard product. Good matrix libraries usually provide fast implementations of the Hadamard product, and that comes in handy when implementing backpropagation.

 

 

The four fundamental equations behind backpropagation

 

Backpropagation is about understanding how changing the weights and biases in a network changes the cost function. Ultimately, this means computing the partial derivatives C/wljk∂C/∂wjkl and C/blj∂C/∂bjl. But to compute those, we first introduce an intermediate quantity, δljδjl, which we call the error in the jthjth neuron in the lthlth layer. Backpropagation will give us a procedure to compute the error δljδjl, and then will relate δljδjl to C/wljk∂C/∂wjkl and C/blj∂C/∂bjl.

To understand how the error is defined, imagine there is a demon in our neural network:

The demon sits at the jthjth neuron in layer ll. As the input to the neuron comes in, the demon messes with the neuron's operation. It adds a little change ΔzljΔzjl to the neuron's weighted input, so that instead of outputting σ(zlj)σ(zjl), the neuron instead outputs σ(zlj+Δzlj)σ(zjl+Δzjl). This change propagates through later layers in the network, finally causing the overall cost to change by an amount CzljΔzlj∂C∂zjlΔzjl.

 

Now, this demon is a good demon, and is trying to help you improve the cost, i.e., they're trying to find a ΔzljΔzjl which makes the cost smaller. Suppose Czlj∂C∂zjl has a large value (either positive or negative). Then the demon can lower the cost quite a bit by choosing ΔzljΔzjl to have the opposite sign to Czlj∂C∂zjl. By contrast, if Czlj∂C∂zjl is close to zero, then the demon can't improve the cost much at all by perturbing the weighted input zljzjl. So far as the demon can tell, the neuron is already pretty near optimal**This is only the case for small changes ΔzljΔzjl, of course. We'll assume that the demon is constrained to make such small changes.. And so there's a heuristic sense in which Czlj∂C∂zjl is a measure of the error in the neuron.

Motivated by this story, we define the error δljδjl of neuron jj in layer llby

δljCzlj.(29)(29)δjl≡∂C∂zjl.

As per our usual conventions, we use δlδl to denote the vector of errors associated with layer ll. Backpropagation will give us a way of computing δlδl for every layer, and then relating those errors to the quantities of real interest, C/wljk∂C/∂wjkl and C/blj∂C/∂bjl.

 

You might wonder why the demon is changing the weighted input zljzjl. Surely it'd be more natural to imagine the demon changing the output activation aljajl, with the result that we'd be using Calj∂C∂ajl as our measure of error. In fact, if you do this things work out quite similarly to the discussion below. But it turns out to make the presentation of backpropagation a little more algebraically complicated. So we'll stick with δlj=Czljδjl=∂C∂zjl as our measure of error**In classification problems like MNIST the term "error" is sometimes used to mean the classification failure rate. E.g., if the neural net correctly classifies 96.0 percent of the digits, then the error is 4.0 percent. Obviously, this has quite a different meaning from our δδ vectors. In practice, you shouldn't have trouble telling which meaning is intended in any given usage..

Plan of attack: Backpropagation is based around four fundamental equations. Together, those equations give us a way of computing both the error δlδl and the gradient of the cost function. I state the four equations below. Be warned, though: you shouldn't expect to instantaneously assimilate the equations. Such an expectation will lead to disappointment. In fact, the backpropagation equations are so rich that understanding them well requires considerable time and patience as you gradually delve deeper into the equations. The good news is that such patience is repaid many times over. And so the discussion in this section is merely a beginning, helping you on the way to a thorough understanding of the equations.

Here's a preview of the ways we'll delve more deeply into the equations later in the chapter: I'll give a short proof of the equations, which helps explain why they are true; we'll restate the equations in algorithmic form as pseudocode, and see how the pseudocode can be implemented as real, running Python code; and, in the final section of the chapter, we'll develop an intuitive picture of what the backpropagation equations mean, and how someone might discover them from scratch. Along the way we'll return repeatedly to the four fundamental equations, and as you deepen your understanding those equations will come to seem comfortable and, perhaps, even beautiful and natural.

An equation for the error in the output layer, δLδL: The components of δLδL are given by

δLj=CaLjσ(zLj).(BP1)(BP1)δjL=∂C∂ajLσ′(zjL).

This is a very natural expression. The first term on the right, C/aLj∂C/∂ajL, just measures how fast the cost is changing as a function of the jthjth output activation. If, for example, CC doesn't depend much on a particular output neuron, jj, then δLjδjL will be small, which is what we'd expect. The second term on the right, σ(zLj)σ′(zjL), measures how fast the activation function σσ is changing at zLjzjL.

 

Notice that everything in (BP1) is easily computed. In particular, we compute zLjzjL while computing the behaviour of the network, and it's only a small additional overhead to compute σ(zLj)σ′(zjL). The exact form of C/aLj∂C/∂ajL will, of course, depend on the form of the cost function. However, provided the cost function is known there should be little trouble computing C/aLj∂C/∂ajL. For example, if we're using the quadratic cost function then C=12j(yjaj)2C=12∑j(yj−aj)2, and so C/aLj=(ajyj)∂C/∂ajL=(aj−yj), which obviously is easily computable.

Equation (BP1) is a componentwise expression for δLδL. It's a perfectly good expression, but not the matrix-based form we want for backpropagation. However, it's easy to rewrite the equation in a matrix-based form, as

δL=aCσ(zL).(BP1a)(BP1a)δL=∇aC⊙σ′(zL).

Here, aC∇aC is defined to be a vector whose components are the partial derivatives C/aLj∂C/∂ajL. You can think of aC∇aC as expressing the rate of change of CC with respect to the output activations. It's easy to see that Equations (BP1a) and (BP1) are equivalent, and for that reason from now on we'll use (BP1) interchangeably to refer to both equations. As an example, in the case of the quadratic cost we have aC=(aLy)∇aC=(aL−y), and so the fully matrix-based form of (BP1)becomes

δL=(aLy)σ(zL).(30)(30)δL=(aL−y)⊙σ′(zL).

As you can see, everything in this expression has a nice vector form, and is easily computed using a library such as Numpy.

 

An equation for the error δlδl in terms of the error in the next layer, δl+1δl+1: In particular

δl=((wl+1)Tδl+1)σ(zl),(BP2)(BP2)δl=((wl+1)Tδl+1)⊙σ′(zl),

where (wl+1)T(wl+1)T is the transpose of the weight matrix wl+1wl+1 for the (l+1)th(l+1)th layer. This equation appears complicated, but each element has a nice interpretation. Suppose we know the error δl+1δl+1at the l+1thl+1th layer. When we apply the transpose weight matrix, (wl+1)T(wl+1)T, we can think intuitively of this as moving the errorbackward through the network, giving us some sort of measure of the error at the output of the lthlth layer. We then take the Hadamard product σ(zl)⊙σ′(zl). This moves the error backward through the activation function in layer ll, giving us the error δlδl in the weighted input to layer ll.

 

By combining (BP2) with (BP1) we can compute the error δlδl for any layer in the network. We start by using (BP1) to compute δLδL, then apply Equation (BP2) to compute δL1δL−1, then Equation (BP2) again to compute δL2δL−2, and so on, all the way back through the network.

An equation for the rate of change of the cost with respect to any bias in the network: In particular:

Cblj=δlj.(BP3)(BP3)∂C∂bjl=δjl.

That is, the error δljδjl is exactly equal to the rate of change C/blj∂C/∂bjl. This is great news, since (BP1) and (BP2) have already told us how to compute δljδjl. We can rewrite (BP3) in shorthand as

Cb=δ,(31)(31)∂C∂b=δ,

where it is understood that δδ is being evaluated at the same neuron as the bias bb.

 

An equation for the rate of change of the cost with respect to any weight in the network: In particular:

Cwljk=al1kδlj.(BP4)(BP4)∂C∂wjkl=akl−1δjl.

This tells us how to compute the partial derivatives C/wljk∂C/∂wjkl in terms of the quantities δlδl and al1al−1, which we already know how to compute. The equation can be rewritten in a less index-heavy notation as

Cw=ainδout,(32)(32)∂C∂w=ainδout,

where it's understood that ainain is the activation of the neuron input to the weight ww, and δoutδout is the error of the neuron output from the weight ww. Zooming in to look at just the weight ww, and the two neurons connected by that weight, we can depict this as:

 

A nice consequence of Equation (32) is that when the activation ainainis small, ain0ain≈0, the gradient term C/w∂C/∂w will also tend to be small. In this case, we'll say the weight learns slowly, meaning that it's not changing much during gradient descent. In other words, one consequence of (BP4) is that weights output from low-activation neurons learn slowly.

 

There are other insights along these lines which can be obtained from (BP1)-(BP4). Let's start by looking at the output layer. Consider the term σ(zLj)σ′(zjL) in (BP1). Recall from the graph of the sigmoid function in the last chapter that the σσ function becomes very flat when σ(zLj)σ(zjL) is approximately 00 or 11. When this occurs we will have σ(zLj)0σ′(zjL)≈0. And so the lesson is that a weight in the final layer will learn slowly if the output neuron is either low activation (0≈0) or high activation (1≈1). In this case it's common to say the output neuron has saturated and, as a result, the weight has stopped learning (or is learning slowly). Similar remarks hold also for the biases of output neuron.

We can obtain similar insights for earlier layers. In particular, note the σ(zl)σ′(zl) term in (BP2). This means that δljδjl is likely to get small if the neuron is near saturation. And this, in turn, means that any weights input to a saturated neuron will learn slowly**This reasoning won't hold if wl+1Tδl+1wl+1Tδl+1 has large enough entries to compensate for the smallness of σ(zlj)σ′(zjl). But I'm speaking of the general tendency..

Summing up, we've learnt that a weight will learn slowly if either the input neuron is low-activation, or if the output neuron has saturated, i.e., is either high- or low-activation.

None of these observations is too greatly surprising. Still, they help improve our mental model of what's going on as a neural network learns. Furthermore, we can turn this type of reasoning around. The four fundamental equations turn out to hold for any activation function, not just the standard sigmoid function (that's because, as we'll see in a moment, the proofs don't use any special properties of σσ). And so we can use these equations to design activation functions which have particular desired learning properties. As an example to give you the idea, suppose we were to choose a (non-sigmoid) activation function σσ so that σσ′ is always positive, and never gets close to zero. That would prevent the slow-down of learning that occurs when ordinary sigmoid neurons saturate. Later in the book we'll see examples where this kind of modification is made to the activation function. Keeping the four equations (BP1)-(BP4) in mind can help explain why such modifications are tried, and what impact they can have.

 

 

 

 

Problem

  • Alternate presentation of the equations of backpropagation: I've stated the equations of backpropagation (notably (BP1) and (BP2)) using the Hadamard product. This presentation may be disconcerting if you're unused to the Hadamard product. There's an alternative approach, based on conventional matrix multiplication, which some readers may find enlightening. (1) Show that (BP1) may be rewritten as
    δL=Σ(zL)aC,(33)(33)δL=Σ′(zL)∇aC,
    where Σ(zL)Σ′(zL) is a square matrix whose diagonal entries are the values σ(zLj)σ′(zjL), and whose off-diagonal entries are zero. Note that this matrix acts on aC∇aC by conventional matrix multiplication. (2) Show that (BP2) may be rewritten as
    δl=Σ(zl)(wl+1)Tδl+1.(34)(34)δl=Σ′(zl)(wl+1)Tδl+1.
    (3) By combining observations (1) and (2) show that
    δl=Σ(zl)(wl+1)TΣ(zL1)(wL)TΣ(zL)aC(35)(35)δl=Σ′(zl)(wl+1)T…Σ′(zL−1)(wL)TΣ′(zL)∇aC
    For readers comfortable with matrix multiplication this equation may be easier to understand than (BP1) and (BP2). The reason I've focused on (BP1) and (BP2) is because that approach turns out to be faster to implement numerically.

 

 

Proof of the four fundamental equations (optional)

 

We'll now prove the four fundamental equations (BP1)-(BP4). All four are consequences of the chain rule from multivariable calculus. If you're comfortable with the chain rule, then I strongly encourage you to attempt the derivation yourself before reading on.

Let's begin with Equation (BP1), which gives an expression for the output error, δLδL. To prove this equation, recall that by definition

δLj=CzLj.(36)(36)δjL=∂C∂zjL.

Applying the chain rule, we can re-express the partial derivative above in terms of partial derivatives with respect to the output activations,

δLj=kCaLkaLkzLj,(37)(37)δjL=∑k∂C∂akL∂akL∂zjL,

where the sum is over all neurons kk in the output layer. Of course, the output activation aLkakL of the kthkth neuron depends only on the input weight zLjzjL for the jthjth neuron when k=jk=j. And so aLk/zLj∂akL/∂zjLvanishes when kjk≠j. As a result we can simplify the previous equation to

δLj=CaLjaLjzLj.(38)(38)δjL=∂C∂ajL∂ajL∂zjL.

Recalling that aLj=σ(zLj)ajL=σ(zjL) the second term on the right can be written as σ(zLj)σ′(zjL), and the equation becomes

δLj=CaLjσ(zLj),(39)(39)δjL=∂C∂ajLσ′(zjL),

which is just (BP1), in component form.

 

Next, we'll prove (BP2), which gives an equation for the error δlδl in terms of the error in the next layer, δl+1δl+1. To do this, we want to rewrite δlj=C/zljδjl=∂C/∂zjl in terms of δl+1k=C/zl+1kδkl+1=∂C/∂zkl+1. We can do this using the chain rule,

δlj===CzljkCzl+1kzl+1kzljkzl+1kzljδl+1k,(40)(41)(42)(40)δjl=∂C∂zjl(41)=∑k∂C∂zkl+1∂zkl+1∂zjl(42)=∑k∂zkl+1∂zjlδkl+1,

where in the last line we have interchanged the two terms on the right-hand side, and substituted the definition of δl+1kδkl+1. To evaluate the first term on the last line, note that

zl+1k=jwl+1kjalj+bl+1k=jwl+1kjσ(zlj)+bl+1k.(43)(43)zkl+1=∑jwkjl+1ajl+bkl+1=∑jwkjl+1σ(zjl)+bkl+1.

Differentiating, we obtain

zl+1kzlj=wl+1kjσ(zlj).(44)(44)∂zkl+1∂zjl=wkjl+1σ′(zjl).

Substituting back into (42) we obtain

δlj=kwl+1kjδl+1kσ(zlj).(45)(45)δjl=∑kwkjl+1δkl+1σ′(zjl).

This is just (BP2) written in component form.

 

The final two equations we want to prove are (BP3) and (BP4). These also follow from the chain rule, in a manner similar to the proofs of the two equations above. I leave them to you as an exercise.

 

Exercise

  • Prove Equations (BP3) and (BP4).

 

That completes the proof of the four fundamental equations of backpropagation. The proof may seem complicated. But it's really just the outcome of carefully applying the chain rule. A little less succinctly, we can think of backpropagation as a way of computing the gradient of the cost function by systematically applying the chain rule from multi-variable calculus. That's all there really is to backpropagation - the rest is details.

 

The backpropagation algorithm

 

The backpropagation equations provide us with a way of computing the gradient of the cost function. Let's explicitly write this out in the form of an algorithm:

  1. Input xx: Set the corresponding activation a1a1 for the input layer.

     

  2. Feedforward: For each l=2,3,,Ll=2,3,…,L computezl=wlal1+blzl=wlal−1+bl and al=σ(zl)al=σ(zl).

     

  3. Output error δLδL: Compute the vector δL=aCσ(zL)δL=∇aC⊙σ′(zL).

     

  4. Backpropagate the error: For each l=L1,L2,,2l=L−1,L−2,…,2compute δl=((wl+1)Tδl+1)σ(zl)δl=((wl+1)Tδl+1)⊙σ′(zl).

     

  5. Output: The gradient of the cost function is given byCwljk=al1kδlj∂C∂wjkl=akl−1δjl and Cblj=δlj∂C∂bjl=δjl.

 

Examining the algorithm you can see why it's calledbackpropagation. We compute the error vectors δlδl backward, starting from the final layer. It may seem peculiar that we're going through the network backward. But if you think about the proof of backpropagation, the backward movement is a consequence of the fact that the cost is a function of outputs from the network. To understand how the cost varies with earlier weights and biases we need to repeatedly apply the chain rule, working backward through the layers to obtain usable expressions.

 

Exercises

  • Backpropagation with a single modified neuronSuppose we modify a single neuron in a feedforward network so that the output from the neuron is given by f(jwjxj+b)f(∑jwjxj+b), where ff is some function other than the sigmoid. How should we modify the backpropagation algorithm in this case?

     

  • Backpropagation with linear neurons Suppose we replace the usual non-linear σσ function with σ(z)=zσ(z)=zthroughout the network. Rewrite the backpropagation algorithm for this case.

 

As I've described it above, the backpropagation algorithm computes the gradient of the cost function for a single training example, C=CxC=Cx. In practice, it's common to combine backpropagation with a learning algorithm such as stochastic gradient descent, in which we compute the gradient for many training examples. In particular, given a mini-batch of mm training examples, the following algorithm applies a gradient descent learning step based on that mini-batch:

  1. Input a set of training examples

     

     

  2. For each training example xx: Set the corresponding input activation ax,1ax,1, and perform the following steps:

     

     

    • Feedforward: For each l=2,3,,Ll=2,3,…,L computezx,l=wlax,l1+blzx,l=wlax,l−1+bl and ax,l=σ(zx,l)ax,l=σ(zx,l).

       

    • Output error δx,Lδx,L: Compute the vectorδx,L=aCxσ(zx,L)δx,L=∇aCx⊙σ′(zx,L).

       

    • Backpropagate the error: For each l=L1,L2,,2l=L−1,L−2,…,2 compute δx,l=((wl+1)Tδx,l+1)σ(zx,l)δx,l=((wl+1)Tδx,l+1)⊙σ′(zx,l).
  3. Gradient descent: For each l=L,L1,,2l=L,L−1,…,2 update the weights according to the rule wlwlηmxδx,l(ax,l1)Twl→wl−ηm∑xδx,l(ax,l−1)T, and the biases according to the rule blblηmxδx,lbl→bl−ηm∑xδx,l.

     

Of course, to implement stochastic gradient descent in practice you also need an outer loop generating mini-batches of training examples, and an outer loop stepping through multiple epochs of training. I've omitted those for simplicity.

 

 

 

The code for backpropagation

 

Having understood backpropagation in the abstract, we can now understand the code used in the last chapter to implement backpropagation. Recall from that chapter that the code was contained in the update_mini_batch and backprop methods of the Networkclass. The code for these methods is a direct translation of the algorithm described above. In particular, the update_mini_batchmethod updates the Network's weights and biases by computing the gradient for the current mini_batch of training examples:

class Network(object):
...
    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The "mini_batch" is a list of tuples "(x, y)", and "eta"
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw 
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb 
                       for b, nb in zip(self.biases, nabla_b)]

Most of the work is done by the line delta_nabla_b, delta_nabla_w = self.backprop(x, y) which uses the backprop method to figure out the partial derivatives Cx/blj∂Cx/∂bjl and Cx/wljk∂Cx/∂wjkl. The backprop method follows the algorithm in the last section closely. There is one small change - we use a slightly different approach to indexing the layers. This change is made to take advantage of a feature of Python, namely the use of negative list indices to count backward from the end of a list, so, e.g., l[-3] is the third last entry in a list l. The code for backprop is below, together with a few helper functions, which are used to compute the σσ function, the derivative σσ′, and the derivative of the cost function. With these inclusions you should be able to understand the code in a self-contained way. If something's tripping you up, you may find it helpful to consult the original description (and complete listing) of the code.

class Network(object):
...
   def backprop(self, x, y):
        """Return a tuple "(nabla_b, nabla_w)" representing the
        gradient for the cost function C_x.  "nabla_b" and
        "nabla_w" are layer-by-layer lists of numpy arrays, similar
        to "self.biases" and "self.weights"."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

...

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y) 

def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

 

 

Problem

  • Fully matrix-based approach to backpropagation over a mini-batch Our implementation of stochastic gradient descent loops over training examples in a mini-batch. It's possible to modify the backpropagation algorithm so that it computes the gradients for all training examples in a mini-batch simultaneously. The idea is that instead of beginning with a single input vector, xx, we can begin with a matrix X=[x1x2xm]X=[x1x2…xm] whose columns are the vectors in the mini-batch. We forward-propagate by multiplying by the weight matrices, adding a suitable matrix for the bias terms, and applying the sigmoid function everywhere. We backpropagate along similar lines. Explicitly write out pseudocode for this approach to the backpropagation algorithm. Modify network.pyso that it uses this fully matrix-based approach. The advantage of this approach is that it takes full advantage of modern libraries for linear algebra. As a result it can be quite a bit faster than looping over the mini-batch. (On my laptop, for example, the speedup is about a factor of two when run on MNIST classification problems like those we considered in the last chapter.) In practice, all serious libraries for backpropagation use this fully matrix-based approach or some variant.

 

 

In what sense is backpropagation a fast algorithm?

 

In what sense is backpropagation a fast algorithm? To answer this question, let's consider another approach to computing the gradient. Imagine it's the early days of neural networks research. Maybe it's the 1950s or 1960s, and you're the first person in the world to think of using gradient descent to learn! But to make the idea work you need a way of computing the gradient of the cost function. You think back to your knowledge of calculus, and decide to see if you can use the chain rule to compute the gradient. But after playing around a bit, the algebra looks complicated, and you get discouraged. So you try to find another approach. You decide to regard the cost as a function of the weights C=C(w)C=C(w) alone (we'll get back to the biases in a moment). You number the weights w1,w2,w1,w2,…, and want to compute C/wj∂C/∂wj for some particular weightwjwj. An obvious way of doing that is to use the approximation

CwjC(w+ϵej)C(w)ϵ,(46)(46)∂C∂wj≈C(w+ϵej)−C(w)ϵ,

where ϵ>0ϵ>0 is a small positive number, and ejej is the unit vector in the jthjth direction. In other words, we can estimate C/wj∂C/∂wj by computing the cost CC for two slightly different values of wjwj, and then applying Equation (46). The same idea will let us compute the partial derivatives C/b∂C/∂b with respect to the biases.

 

This approach looks very promising. It's simple conceptually, and extremely easy to implement, using just a few lines of code. Certainly, it looks much more promising than the idea of using the chain rule to compute the gradient!

Unfortunately, while this approach appears promising, when you implement the code it turns out to be extremely slow. To understand why, imagine we have a million weights in our network. Then for each distinct weight wjwj we need to compute C(w+ϵej)C(w+ϵej) in order to compute C/wj∂C/∂wj. That means that to compute the gradient we need to compute the cost function a million different times, requiring a million forward passes through the network (per training example). We need to compute C(w)C(w) as well, so that's a total of a million and one passes through the network.

What's clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives C/wj∂C/∂wj using just one forward pass through the network, followed by one backward pass through the network. Roughly speaking, the computational cost of the backward pass is about the same as the forward pass**This should be plausible, but it requires some analysis to make a careful statement. It's plausible because the dominant computational cost in the forward pass is multiplying by the weight matrices, while in the backward pass it's multiplying by the transposes of the weight matrices. These operations obviously have similar computational cost.. And so the total cost of backpropagation is roughly the same as making just two forward passes through the network. Compare that to the million and one forward passes we needed for the approach based on (46)! And so even though backpropagation appears superficially more complex than the approach based on(46), it's actually much, much faster.

This speedup was first fully appreciated in 1986, and it greatly expanded the range of problems that neural networks could solve. That, in turn, caused a rush of people using neural networks. Of course, backpropagation is not a panacea. Even in the late 1980s people ran up against limits, especially when attempting to use backpropagation to train deep neural networks, i.e., networks with many hidden layers. Later in the book we'll see how modern computers and some clever new ideas now make it possible to use backpropagation to train such deep neural networks.

 

Backpropagation: the big picture

 

As I've explained it, backpropagation presents two mysteries. First, what's the algorithm really doing? We've developed a picture of the error being backpropagated from the output. But can we go any deeper, and build up more intuition about what is going on when we do all these matrix and vector multiplications? The second mystery is how someone could ever have discovered backpropagation in the first place? It's one thing to follow the steps in an algorithm, or even to follow the proof that the algorithm works. But that doesn't mean you understand the problem so well that you could have discovered the algorithm in the first place. Is there a plausible line of reasoning that could have led you to discover the backpropagation algorithm? In this section I'll address both these mysteries.

To improve our intuition about what the algorithm is doing, let's imagine that we've made a small change ΔwljkΔwjkl to some weight in the network, wljkwjkl:

That change in weight will cause a change in the output activation from the corresponding neuron:

That, in turn, will cause a change in all the activations in the next layer:

Those changes will in turn cause changes in the next layer, and then the next, and so on all the way through to causing a change in the final layer, and then in the cost function:

The change ΔCΔC in the cost is related to the change ΔwljkΔwjkl in the weight by the equation

ΔCCwljkΔwljk.(47)(47)ΔC≈∂C∂wjklΔwjkl.

This suggests that a possible approach to computing Cwljk∂C∂wjkl is to carefully track how a small change in wljkwjkl propagates to cause a small change in CC. If we can do that, being careful to express everything along the way in terms of easily computable quantities, then we should be able to compute C/wljk∂C/∂wjkl.

 

Let's try to carry this out. The change ΔwljkΔwjkl causes a small change ΔaljΔajl in the activation of the jthjth neuron in the lthlth layer. This change is given by

ΔaljaljwljkΔwljk.(48)(48)Δajl≈∂ajl∂wjklΔwjkl.

The change in activation ΔaljΔajl will cause changes in all the activations in the next layer, i.e., the (l+1)th(l+1)th layer. We'll concentrate on the way just a single one of those activations is affected, say al+1qaql+1,

 

In fact, it'll cause the following change:

Δal+1qal+1qaljΔalj.(49)(49)Δaql+1≈∂aql+1∂ajlΔajl.

Substituting in the expression from Equation (48), we get:

Δal+1qal+1qaljaljwljkΔwljk.(50)(50)Δaql+1≈∂aql+1∂ajl∂ajl∂wjklΔwjkl.

Of course, the change Δal+1qΔaql+1 will, in turn, cause changes in the activations in the next layer. In fact, we can imagine a path all the way through the network from wljkwjkl to CC, with each change in activation causing a change in the next activation, and, finally, a change in the cost at the output. If the path goes through activationsalj,al+1q,,aL1n,aLmajl,aql+1,…,anL−1,amL then the resulting expression is

ΔCCaLmaLmaL1naL1naL2pal+1qaljaljwljkΔwljk,(51)(51)ΔC≈∂C∂amL∂amL∂anL−1∂anL−1∂apL−2…∂aql+1∂ajl∂ajl∂wjklΔwjkl,

that is, we've picked up a a/a∂a/∂a type term for each additional neuron we've passed through, as well as the C/aLm∂C/∂amL term at the end. This represents the change in CC due to changes in the activations along this particular path through the network. Of course, there's many paths by which a change in wljkwjkl can propagate to affect the cost, and we've been considering just a single path. To compute the total change in CC it is plausible that we should sum over all the possible paths between the weight and the final cost, i.e.,

ΔCmnpqCaLmaLmaL1naL1naL2pal+1qaljaljwljkΔwljk,(52)(52)ΔC≈∑mnp…q∂C∂amL∂amL∂anL−1∂anL−1∂apL−2…∂aql+1∂ajl∂ajl∂wjklΔwjkl,

where we've summed over all possible choices for the intermediate neurons along the path. Comparing with (47) we see that

Cwljk=mnpqCaLmaLmaL1naL1naL2pal+1qaljaljwljk.(53)(53)∂C∂wjkl=∑mnp…q∂C∂amL∂amL∂anL−1∂anL−1∂apL−2…∂aql+1∂ajl∂ajl∂wjkl.

Now, Equation (53) looks complicated. However, it has a nice intuitive interpretation. We're computing the rate of change of CCwith respect to a weight in the network. What the equation tells us is that every edge between two neurons in the network is associated with a rate factor which is just the partial derivative of one neuron's activation with respect to the other neuron's activation. The edge from the first weight to the first neuron has a rate factor alj/wljk∂ajl/∂wjkl. The rate factor for a path is just the product of the rate factors along the path. And the total rate of change C/wljk∂C/∂wjkl is just the sum of the rate factors of all paths from the initial weight to the final cost. This procedure is illustrated here, for a single path:

 

What I've been providing up to now is a heuristic argument, a way of thinking about what's going on when you perturb a weight in a network. Let me sketch out a line of thinking you could use to further develop this argument. First, you could derive explicit expressions for all the individual partial derivatives in Equation(53). That's easy to do with a bit of calculus. Having done that, you could then try to figure out how to write all the sums over indices as matrix multiplications. This turns out to be tedious, and requires some persistence, but not extraordinary insight. After doing all this, and then simplifying as much as possible, what you discover is that you end up with exactly the backpropagation algorithm! And so you can think of the backpropagation algorithm as providing a way of computing the sum over the rate factor for all these paths. Or, to put it slightly differently, the backpropagation algorithm is a clever way of keeping track of small perturbations to the weights (and biases) as they propagate through the network, reach the output, and then affect the cost.

Now, I'm not going to work through all this here. It's messy and requires considerable care to work through all the details. If you're up for a challenge, you may enjoy attempting it. And even if not, I hope this line of thinking gives you some insight into what backpropagation is accomplishing.

What about the other mystery - how backpropagation could have been discovered in the first place? In fact, if you follow the approach I just sketched you will discover a proof of backpropagation. Unfortunately, the proof is quite a bit longer and more complicated than the one I described earlier in this chapter. So how was that short (but more mysterious) proof discovered? What you find when you write out all the details of the long proof is that, after the fact, there are several obvious simplifications staring you in the face. You make those simplifications, get a shorter proof, and write that out. And then several more obvious simplifications jump out at you. So you repeat again. The result after a few iterations is the proof we saw earlier**There is one clever step required. In Equation(53) the intermediate variables are activations like al+1qaql+1. The clever idea is to switch to using weighted inputs, like zl+1qzql+1, as the intermediate variables. If you don't have this idea, and instead continue using the activations al+1qaql+1, the proof you obtain turns out to be slightly more complex than the proof given earlier in the chapter. - short, but somewhat obscure, because all the signposts to its construction have been removed! I am, of course, asking you to trust me on this, but there really is no great mystery to the origin of the earlier proof. It's just a lot of hard work simplifying the proof I've sketched in this section.

from: http://neuralnetworksanddeeplearning.com/chap2.html