Machine Learning Week_5 Cost Function and BackPropagation

0 Neural Networks: Learning
1 Cost Function and BackPropagation
2 Backpropagation in Pratice
3 Autonomous Driving
Neural Network-Based Autonomous Driving. 1992 11 23

As for the back propagation algorithm, the formula given by the teacher is really useful.

But you don't understand why you're doing this, including what delta means. And the best way to do that is to actually compute a small neural network, using the chain rule for derivatives. Calculate each θ once. Then put them together to understand how to use vectorization implementation.

There are no meanings. There are just laws of arithmetic.

0 Neural Networks: Learning

In Week 5, you will be learning how to train Neural Networks. The Neural Network is one of the most powerful learning algorithms (when a linear classifier doesn't work, this is what I usually turn to), and this week's videos explain the 'backpropagation' algorithm for training these models. In this week's programming assignment, you'll also get to implement this algorithm and see it work for yourself.

The Neural Network programming exercise will be one of the more challenging ones of this class. So please start early and do leave extra time to get it done, and I hope you'll stick with it until you get it to work! As always, if you get stuck on the quiz and programming assignment, you should post on the Discussions to ask for help. (And if you finish early, I hope you'll go there to help your fellow classmates as well.)-- by Andrew NG

1 Cost Function and BackPropagation

1.1 Cost Function

Let's first define a few variables that we will need to use:

L = total number of layers in the network
$s_{l}$ = number of units (not counting bias unit) in layer l
K = number of output units/classes
Binary classification: y = 0 or y = 1, K=1;
Multi-class classification: K>=3;

h_{Θ} (x) \in R^{K} y \in R^{K}

Recall that in neural networks, we may have many output nodes. We denote $h_{Θ} (x)_{k}$ as being a hypothesis that results in the $k^{t h}$ output. Our cost function for neural networks is going to be a generalization of the one we used for logistic regression. Recall that the cost function for regularized logistic regression was:

J (θ) = - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} \log (h_{θ} (x^{(i)})) + (1 - y^{(i)}) \log (1 - h_{θ} (x^{(i)}))] + \frac{λ}{2 m} \sum_{j = 1}^{n} θ_{j}^{2}

For neural networks, it is going to be slightly more complicated:

\begin{matrix} J (Θ) = - \frac{1}{m} \sum_{i = 1}^{m} \sum_{k = 1}^{K} [y_{k}^{(i)} \log ((h_{Θ} (x^{(i)}))_{k}) + (1 - y_{k}^{(i)}) \log (1 - (h_{Θ} (x^{(i)}))_{k})] + \frac{λ}{2 m} \sum_{l = 1}^{L - 1} \sum_{i = 1}^{s_{l}} \sum_{j = 1}^{s_{l + 1}} (Θ_{j, i}^{(l)})^{2} \end{matrix}

We have added a few nested summations to account for our multiple output nodes. In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes.

With the explanation of the regularization part, the lectures are not as same as what theacher says. So I do some corrections.

Teacher
In the regularization part, Completely, we don't sum over the terms responding to where i is equal to 0. And so this is kinda like a bias unit and by analogy to what we were doing for logistic progression, we won't sum over those terms in our regularization term because we don't want to regularize them and string their values as zero. But this is just one possible convention, and even if you were to sum over i equals 0 up to Sl, it would work about the same and doesn't make a big difference. But maybe this convention of not regularizing the bias term is just slightly more common. Corresponds to the formula above.

Lecture

\begin{matrix} J (Θ) = - \frac{1}{m} \sum_{i = 1}^{m} \sum_{k = 1}^{K} [y_{k}^{(i)} \log ((h_{Θ} (x^{(i)}))_{k}) + (1 - y_{k}^{(i)}) \log (1 - (h_{Θ} (x^{(i)}))_{k})] + \frac{λ}{2 m} \sum_{l = 1}^{L - 1} \sum_{i = 0}^{s_{l}} \sum_{j = 1}^{s_{l + 1}} (Θ_{j, i}^{(l)})^{2} \end{matrix}

In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we square every term.

Note:

the double sum simply adds up the logistic regression costs calculated for each cell in the output layer
the triple sum simply adds up the squares of all the individual Θs in the entire network.
the i in the triple sum does not refer to training example i

1.2 Backpropagation Algorithm

"Backpropagation" is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression. Our goal is to compute:

min_{Θ} J (Θ)

That is, we want to minimize our cost function J using an optimal set of parameters in theta. In this section we'll look at the equations we use to compute the partial derivative of J(Θ):

\frac{\partial}{\partial Θ_{i, j}^{(l)}} J (Θ)

To do so, we use the following algorithm:

Back propagation Algorithm

Given training set ${(x^{(1)}, y^{(1)}) \dots (x^{(m)}, y^{(m)})}$

Set $Δ_{i, j}^{(l)}$ := 0 for all (l,i,j), (hence you end up having a matrix full of zeros)

For training example t =1 to m:

Set $a^{(1)} := x^{(t)}$
Perform forward propagation to compute $a^{(l)}$ for l=2,3,…,L
Using $y^{(t)}$ , compute $δ^{(L)} = a^{(L)} - y^{(t)}$

Where L is our total number of layers and $a^{(L)}$ is the vector of outputs of the activation units for the last layer. So our "error values" for the last layer are simply the differences of our actual results in the last layer and the correct outputs in y. To get the delta values of the layers before the last layer, we can use an equation that steps us back from right to left:

Compute $δ^{(L - 1)}, δ^{(L - 2)}, \dots, δ^{(2)}$ using $δ^{(l)} = ((Θ^{(l)})^{T} δ^{(l + 1)}) . * a^{(l)} . * (1 - a^{(l)})$

The delta values of layer l are calculated by multiplying the delta values in the next layer with the theta matrix of layer l. We then element-wise multiply that with a function called g', or g-prime, which is the derivative of the activation function g evaluated with the input values given by $z^{(l)}$ .

The g-prime derivative terms can also be written out as:

g^{'} (z^{(l)}) = a^{(l)} . * (1 - a^{(l)})

$Δ_{i, j}^{(l)} := Δ_{i, j}^{(l)} + a_{j}^{(l)} δ_{i}^{(l + 1)}$ or with vectorization, $Δ^{(l)} := Δ^{(l)} + δ^{(l + 1)} (a^{(l)})^{T}$

Hence we update our new $Δ$ matrix.

$D_{i, j}^{(l)} := \frac{1}{m} (Δ_{i, j}^{(l)} + λ Θ_{i, j}^{(l)})$ , if j≠0.
$D_{i, j}^{(l)} := \frac{1}{m} Δ_{i, j}^{(l)}$ If j=0.

The capital-delta matrix D is used as an "accumulator" to add up our values as we go along and eventually compute our partial derivative. Thus we get $\frac{\partial}{\partial Θ_{i j}^{(l)}} J (Θ)$

1.3 Backpropagation Intuition

Note: [4:39, the last term for the calculation for $z_{1}^{3}$ (three-color handwritten formula) should be $a_{2}^{2}$ instead of $a_{1}^{2}$ . 6:08 - the equation for cost(i) is incorrect. The first term is missing parentheses for the log() function, and the second term should be $(1 - y^{(i)}) \log (1 - h_{θ} (x^{(i)}))$ . 8:50 - $δ^{(4)} = y - a^{(4)}$ is incorrect and should be $δ^{(4)} = a^{(4)} - y$ .]

Recall that the cost function for a neural network is:

\begin{matrix} J (Θ) = - \frac{1}{m} \sum_{t = 1}^{m} \sum_{k = 1}^{K} [y_{k}^{(t)} \log (h_{Θ} (x^{(t)}))_{k} + (1 - y_{k}^{(t)}) \log (1 - h_{Θ} (x^{(t)})_{k})] + \frac{λ}{2 m} \sum_{l = 1}^{L - 1} \sum_{i = 1}^{s_{l}} \sum_{j = 1}^{s_{l} + 1} (Θ_{j, i}^{(l)})^{2} \end{matrix}

If we consider simple non-multiclass classification (k = 1) and disregard regularization, the cost is computed with:

c o s t (t) = y^{(t)} \log (h_{Θ} (x^{(t)})) + (1 - y^{(t)}) \log (1 - h_{Θ} (x^{(t)}))

Intuitively, $δ_{j}^{(l)}$ is the "error" for $a_{j}^{(l)}$ (unit j in layer l). More formally, the delta values are actually the derivative of the cost function:

δ_{j}^{(l)} = \frac{\partial}{\partial z_{j}^{(l)}} c o s t (t)

Recall that our derivative is the slope of a line tangent to the cost function, so the steeper the slope the more incorrect we are. Let us consider the following neural network below and see how we could calculate some $δ_{j}^{(l)}$ :

In the image above, to calculate $δ_{2}^{(2)}$ , we multiply the weights $Θ_{12}^{(2)}$ and $Θ_{22}^{(2)}$ by their respective $δ$ values found to the right of each edge. So we get $δ_{2}^{(2)}$ = $Θ_{12}^{(2)}$ * $δ_{1}^{(3)}$ + $Θ_{22}^{(2)}$ * $δ_{2}^{(3)}$ . To calculate every single possible $δ_{j}^{(l)}$ , we could start from the right of our diagram. We can think of our edges as our $Θ_{i j}$ . Going from right to left, to calculate the value of $δ_{j}^{(l)}$ , you can just take the over all sum of each weight times the $δ$ it is coming from. Hence, another example would be $δ_{2}^{(3)}$ = $Θ_{12}^{(3)}$ * $δ_{1}^{(4)}$ .

2 Backpropagation in Pratice

2.1 Implementation Note: Unrolling Parameters

With neural networks, we are working with sets of matrices:

\begin{array}{r} Θ^{(1)}, Θ^{(2)}, Θ^{(3)}, \dots \\ D^{(1)}, D^{(2)}, D^{(3)}, \dots \end{array}

In order to use optimizing functions such as "fminunc()", we will want to "unroll" all the elements and put them into one long vector:

thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [ D1(:); D2(:); D3(:) ]

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11, then we can get back our original matrices from the "unrolled" versions as follows:

Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)

To summarize:

2.2 Gradient Checking

Gradient checking will assure that our backpropagation works as intended. We can approximate the derivative of our cost function with:

\frac{\partial}{\partial Θ} J (Θ) \approx \frac{J (Θ + ϵ) - J (Θ - ϵ)}{2 ϵ}

With multiple theta matrices, we can approximate the derivative with respect to $Θ_{j}$ as follows:

\frac{\partial}{\partial Θ_{j}} J (Θ) \approx \frac{J (Θ_{1}, \dots, Θ_{j} + ϵ, \dots, Θ_{n}) - J (Θ_{1}, \dots, Θ_{j} - ϵ, \dots, Θ_{n})}{2 ϵ}

A small value for $ϵ$ (epsilon) such as $ϵ = 10^{- 4}$ , guarantees that the math works out properly. If the value for ϵ is too small, we can end up with numerical problems.

Hence, we are only adding or subtracting epsilon to the $Θ_{j}$ matrix. In octave we can do it as follows:

epsilon = 1e-4;
for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;

We previously saw how to calculate the deltaVector. So once we compute our gradApprox vector, we can check that gradApprox ≈ deltaVector.

Once you have verified once that your backpropagation algorithm is correct, you don't need to compute gradApprox again. The code to compute gradApprox can be very slow.

2.3 Random Initialization

When you're running an algorithm of gradient descent, or also the advanced optimization algorithms, we need to pick some initial value for the parameters theta. So for the advanced optimization algorithm, it assumes you will pass it some initial value for the parameters theta.

optTheta = fminunc(@costFunction, initialTheta, options)

Is it possible to set the initial value of theta to the vector of all zeros.Whereas this worked okay when we were using logistic regression, initializing all of your parameters to zero actually does not work when you are trading on your own network.

神经网络是多个函数作用在一起，形成非线性的函数做出预测。隐藏层的每一个节点都对应一个不同的函数，也就是不同的参数。例如一个100个节点的隐藏层就计算了100个不同的参数。

一旦所有初始参数都相同，那么所有的隐藏层节点计算的函数只有一个。整个神经网络就从100个节点变成了一个节点，无论是前向传播还是反向传播，都只是一个函数，怎么更新都是一样的。

同时注意到，老师所讲的，神经网络的梯度下降常常会得到一个局部最优解而不是全局最优解，也就是说，代价函数不是一个convex function。这里很耐人寻味。因为逻辑回归的代价函数与神经网络的代价函数基本上是一样的，但是神经网络的代价函数里嵌套了多个逻辑回归函数。逻辑回归是一个convex function, 神经网络的就不是了。

Initializing all theta weights to zero does not work with neural networks. When we backpropagate, all nodes will update to the same value repeatedly. Instead we can randomly initialize our weights for our Θ matrices using the following method:

Hence, we initialize each $Θ_{i j}^{(l)}$ to a random value between $[- ϵ, ϵ]$ . Using the above formula guarantees that we get the desired bound. The same procedure applies to all the $Θ$ 's. Below is some working code you could use to experiment.

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

rand(x,y) is just a function in octave that will initialize a matrix of random real numbers between 0 and 1.

(Note: the epsilon used above is unrelated to the epsilon from Gradient Checking)

2.4 Putting it Together

First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have.

Number of input units = dimension of features $x^{(i)}$
Number of output units = number of classes
Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.

Training a Neural Network

Randomly initialize the weights
Implement forward propagation to get $h_{Θ} (x^{(i)})$ for any $x^{(i)}$
Implement the cost function
Implement backpropagation to compute partial derivatives
Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.

When we perform forward and back propagation, we loop on every training example:
Training a Neural Network

for i = 1:m,
   Perform forward propagation and backpropagation using example (x(i),y(i))
   (Get activations a(l) and delta terms d(l) for l = 2,...,L

The following image gives us an intuition of what is happening as we are implementing our neural network:

And by the way, for neural networks, this cost function j of theta is non-convex, or is not convex and so it can theoretically be susceptible to local minima, and in fact algorithms like gradient descent and the advance optimization methods can, in theory, get stuck in local optima, but it turns out that in practice this is not usually a huge problem and even though we can't guarantee that these algorithms will find a global optimum, usually algorithms like gradient descent will do a very good job minimizing this cost function j of theta and get a very good local minimum, even if it doesn't get to the global optimum. Finally, gradient descents for a neural network might still seem a little bit magical. So, let me just show one more figure to try to get that intuition about what gradient descent for a neural network is doing.

This was actually similar to the figure that I was using earlier to explain gradient descent. So, we have some cost function, and we have a number of parameters in our neural network. Right here I've just written down two of the parameter values.

So what gradient descent does is we'll start from some random initial point like that one over there, and it will repeatedly go downhill.

And so what back propagation is doing is computing the direction of the gradient, and what gradient descent is doing is it's taking little steps downhill until hopefully it gets to, in this case, a pretty good local optimum.

So, when you implement back propagation and use gradient descent or one of the advanced optimization methods, this picture sort of explains what the algorithm is doing. It's trying to find a value of the parameters where the output values in the neural network closely matches the values of the y(i)'s observed in your training set. So, hopefully this gives you a better sense of how the many different pieces of neural network learning fit together.

Ideally, you want $h_{Θ} (x^{(i)}) \approx y^{(i)}$ . This will minimize our cost function. However, keep in mind that $J (Θ)$ is not convex and thus we can end up in a local minimum instead.

3 Autonomous Driving

In this video, I'd like to show you a fun and historically important example of neural networks learning of using a neural network for autonomous driving. That is getting a car to learn to drive itself.

The video that I'll showed a minute was something that I'd gotten from Dean Pomerleau, who was a colleague who works out in Carnegie Mellon University out on the east coast of the United States. And in part of the video you see visualizations like this. And I want to tell what a visualization looks like before starting the video.

Down here on the lower left is the view seen by the car of what's in front of it. And so here you kinda see a road that's maybe going a bit to the left, and then going a little bit to the right.

And up here on top, this first horizontal bar shows the direction selected by the human driver. And this location of this bright white band that shows the steering direction selected by the human driver where you know here far to the left corresponds to steering hard left, here corresponds to steering hard to the right. And so this location which is a little bit to the left, a little bit left of center means that the human driver at this point was steering slightly to the left. And this second bar here corresponds to the steering direction selected by the learning algorithm and again the location of this sort of white band means that the neural network was here selecting a steering direction that's slightly to the left.

And in fact before the neural network starts leaning initially, you see that the network outputs a grey band, like a grey, like a uniform grey band throughout this region and sort of a uniform gray fuzz corresponds to the neural network having been randomly initialized. And initially having no idea how to drive the car. Or initially having no idea of what direction to steer in. And is only after it has learned for a while, that will then start to output like a solid white band in just a small part of the region corresponding to choosing a particular steering direction. And that corresponds to when the neural network becomes more confident in selecting a band in one particular location, rather than outputting a sort of light gray fuzz, but instead outputting a white band that's more constantly selecting one's steering direction.

Neural Network-Based Autonomous Driving. 1992 11 23

ALVINN is a system of artificial neural networks that learns to steer by watching a person drive. ALVINN is designed to control the NAVLAB 2, a modified Army Humvee who had put sensors, computers, and actuators for autonomous navigation experiments.

The initial step in configuring ALVINN is creating a network just here. During training, a person drives the vehicle while ALVINN watches. Once every two seconds, ALVINN digitizes a video image of the road ahead, and records the person's steering direction.

This training image is reduced in resolution to 30 by 32 pixels and provided as input to ALVINN's three layered network. Using the back propagation learning algorithm,ALVINN is training to output the same steering direction as the human driver for that image.

Initially the network steering response is random. After about two minutes of training the network learns to accurately imitate the steering reactions of the human driver. This same training procedure is repeated for other road types. After the networks have been trained the operator pushes the run switch and ALVINN begins driving.

Twelve times per second, ALVINN digitizes the image and feeds it to its neural networks. Each network, running in parallel, produces a steering direction, and a measure of its' confidence in its' response.

The steering direction, from the most confident network, in this network training for the one lane road, is used to control the vehicle.

Suddenly an intersection appears ahead of the vehicle. As the vehicle approaches the intersection the confidence of the lone lane network decreases. As it crosses the intersection and the two lane road ahead comes into view, the confidence of the two lane network rises.

When its' confidence rises the two lane network is selected to steer. Safely guiding the vehicle into its lane onto the two lane road.

So that was autonomous driving using the neural network. Of course there are more recently more modern attempts to do autonomous driving. There are few projects in the US and Europe and so on, that are giving more robust driving controllers than this, but I think it's still pretty remarkable and pretty amazing how instant neural network trained with backpropagation can actually learn to drive a car somewhat well.

posted @ 2021-10-26 20:09 Dba_sys 阅读(70) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

阅读排行：
· 全程不用写代码，我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了，比商业数据库还牛
· 白话解读 Dapr 1.15：你的「微服务管家」又秀新绝活了

公告

昵称： Dba_sys
园龄： 3年11个月
粉丝： 22
关注： 0

+加关注

2025年3月

日

一

二

三

四

五

六

Dba_sys

有理想的人是幸福的

Machine Learning Week_5 Cost Function and BackPropagation

0 Neural Networks: Learning

1 Cost Function and BackPropagation

1.1 Cost Function

1.2 Backpropagation Algorithm

1.3 Backpropagation Intuition

2 Backpropagation in Pratice

2.1 Implementation Note: Unrolling Parameters

2.2 Gradient Checking

2.3 Random Initialization

2.4 Putting it Together

3 Autonomous Driving

Neural Network-Based Autonomous Driving. 1992 11 23

公告

搜索

积分与排名

随笔分类 (86)

文章档案 (8)

阅读排行榜

评论排行榜

推荐排行榜