[C6] Andrew Ng - Convolutional Neural Networks

About this Course

This course will teach you how to build convolutional neural networks and apply it to image data. Thanks to deep learning, computer vision is working far better than just two years ago, and this is enabling numerous exciting applications ranging from safe autonomous driving, to accurate face recognition, to automatic reading of radiology images.

You will:

  • Understand how to build a convolutional neural network, including recent variations such as residual networks.
  • Know how to apply convolutional networks to visual detection and recognition tasks.
  • Know to use neural style transfer to generate art.
  • Be able to apply these algorithms to a variety of image, video, and other 2D or 3D data.

This is the fourth course of the Deep Learning Specialization.

Foundations of Convolutional Neural Networks

Learn to implement the foundational layers of CNNs (pooling, convolutions) and to stack them properly in a deep network to solve multi-class image classification problems.

Computer Vision - 5m

Welcome to this course on Convolutional Networks. Computer vision is one of the areas that's been advancing rapidly thanks to deep learning. Deep learning computer vision is now helping self-driving cars figure out where the other cars and pedestrians around so as to avoid them. Is making face recognition work much better than ever before, so that perhaps some of you will soon, or perhaps already, be able to unlock a phone, unlock even a door using just your face. And if you look on your cell phone, I bet you have many apps that show you pictures of food, or pictures of a hotel, or just fun pictures of scenery. And some of the companies that build those apps are using deep learning to help show you the most attractive, the most beautiful, or the most relevant pictures. And I think deep learning is even enabling new types of art to be created. So, I think the two reasons I'm excited about deep learning for computer vision and why I think you might be too. First, rapid advances in computer vision are enabling brand new applications to view, though they just were impossible a few years ago. And by learning these tools, perhaps you will be able to invent some of these new products and applications. Second, even if you don't end up building computer vision systems per se, I found that because the computer vision research community has been so creatiy look at case studive and so inventive in coming up with new neural network architectures and algorithms, is actually inspire that creates a lot cross-fertilization into other areas as well. For example, when I was working on speech recognition, I sometimes actually took inspiration from ideas from computer vision and borrowed them into the speech literature. So, even if you don't end up working on computer vision, I hope that you find some of the ideas you learn about in this course hopeful for some of your algorithms and your architectures. So with that, let's get started. Here are some examples of computer vision problems we'll study in this course. You've already seen image classifications, sometimes also called image recognition, where you might take as input say a 64 by 64 image and try to figure out, is that a cat? Another example of the computer vision problem is object detection. So, if you're building a self-driving car, maybe you don't just need to figure out that there are other cars in this image. But instead, you need to figure out the position of the other cars in this picture, so that your car can avoid them. In object detection, usually, we have to not just figure out that these other objects say cars and picture, but also draw boxes around them. We have some other way of recognizing where in the picture are these objects. And notice also, in this example, that they can be multiple cars in the same picture, or at least every one of them within a certain distance of your car. Here's another example, maybe a more fun one is neural style transfer. Let's say you have a picture, and you want this picture repainted in a different style. So neural style transfer, you have a content image, and you have a style image. The image on the right is actually a Picasso. And you can have a neural network put them together to repaint the content image (that is the image on the left), but in the style of the image on the right, and you end up with the image at the bottom. So, algorithms like these are enabling new types of artwork to be created. And in this course, you'll learn how to do this yourself as well. One of the challenges of computer vision problems is that the inputs can get really big. For example, in previous courses, you've worked with 64 by 64 images. And so that's 64 by 64 by 3 because there are three color channels. And if you multiply that out, that's 12288. So x the input features has dimension 12288. And that's not too bad. But 64 by 64 is actually a very small image. If you work with larger images, maybe this is a 1000 pixel by 1000 pixel image, and that's actually just one megapixel. But the dimension of the input features will be 1000 by 1000 by 3, because you have three RGB channels, and that's three million. If you are viewing this on a smaller screen, this might not be apparent, but this is actually a low res 64 by 64 image, and this is a higher res 1000 by 1000 image. But if you have three million input features, then this means that X here will be three million dimensional. And so, if in the first hidden layer maybe you have just a 1000 hidden units, then the total number of weights that is the matrix W1, if you use a standard or fully connected network like we have in courses one or two. This matrix will be a 1000 by 3 million dimensional matrix. Because X is now R by three million. 3m. I'm using to denote three million. And this means that this matrix here will have three billion parameters which is just very, very large. And with that many parameters, it's difficult to get enough data to prevent a neural network from overfitting. And also, the computational requirements and the memory requirements to train a neural network with three billion parameters is just a bit infeasible. But for computer vision applications, you don't want to be stuck using only tiny little images. You want to use large images. To do that, you need to better implement the convolution operation, which is one of the fundamental building blocks of convolutional neural networks. Let's see what this means, and how you can implement this, in the next video. And we'll illustrate convolutions, using the example of Edge Detection.

Edge Detection Example - 11m

The convolution operation is one of the fundamental building blocks of a convolutional neural network. Using edge detection as the motivating example in this video, you will see how the convolution operation works. In previous videos, I have talked about how the early layers of the neural network might detect edges and then the some later layers might detect cause of objects and then even later layers may detect cause of complete objects like people's faces in this case. In this video, you see how you can detect edges in an image. Lets take an example. Given a picture like that for a computer to figure out what are the objects in this picture, the first thing you might do is maybe detect vertical edges in this image. For example, this image has all those vertical lines, where the buildings are, as well as kind of vertical lines idea all lines of these pedestrians and so those get detected in this vertical edge detector output. And you might also want to detect horizontal edges so for example, there is a very strong horizontal line where this railing is and that also gets detected sort of roughly here. How do you detect edges in image like this? Let us look with an example. Here is a 6 by 6 grayscale image and because this is a grayscale image, this is just a 6 by 6 by 1 matrix rather than 6 by 6 by 3 because they are on a separate rgb channels. In order to detect edges or lets say vertical edges in his image, what you can do is construct a 3 by 3 matrix and in the pollens when the terminology of convolutional neural networks, this is going to be called a filter. And I am going to construct a 3 by 3 filter or 3 by 3 matrix that looks like this 1, 1, 1, 0, 0, 0, -1, -1, -1. Sometimes research papers will call this a kernel instead of a filter but I am going to use the filter terminology in these videos. And what you are going to do is take the 6 by 6 image and convolve it and the convolution operation is denoted by this asterisk and convolve it with the 3 by 3 filter. One slightly unfortunate thing about the notation is that in mathematics, the asterisk is the standard symbol for convolution but in Python, this is also used to denote multiplication or maybe element wise multiplication. This asterisk has dual purposes is overloaded notation but I will try to be clear in these videos when this asterisk refers to convolution. The output of this convolution operator will be a 4 by 4 matrix, which you can interpret, which you can think of as a 4 by 4 image. The way you compute this 4 by 4 output is as follows, to compute the first elements, the upper left element of this 4 by 4 matrix, what you are going to do is take the 3 by 3 filter and paste it on top of the 3 by 3 region of your original input image. I have written here 1, 1, 1, 0, 0, 0, -1, -1, -1. And what you should do is take the element wise product so the first one would be three times 1 and then the second one would be one times one I'm going down here, one times one and then plus two times one, just one and then add up all of the resulting nine numbers. So then the middle column gives you zero times zero, plus five times zero, plus seven times zero and then the right most column gives one times -1, eight times -1, plus two times -1. Adding up these nine numbers will give you negative 5 and so I'm going to fill in negative 5 over here. You can add up these nine numbers in any order of course. It is just that I went down the first column, then second column, then the third. Next, to figure out what is this second element, you are going to take the blue square and shift it one step to the right like so. Let me get rid of the green marks here. You are going to do the same element wise product and then addition. You have zero times one, plus five times one, plus seven times one, plus one time zero, plus eight times zero, plus two times zero, plus two times negative 1, plus nine times negative one, plus five times negative one and if you add up those nine numbers, you end up with negative four and so on. If you shift this to the right, do the nine products and add them up, you get zero and then over here you should get 8. Just to verify, you have 2 plus 9 plus 5 that's 16. Then the middle column gives you zero and then the right most column 4 plus 1 plus three times negative 1, that's -8 so that is 16 on the left column -8 and that gives you 8 like we have over here. Next, in order to get you this element in the next row what you do is take the blue square and now shift it one down so you now have it in that position, and again repeat the element wise products and then adding exercise. If you do that, you should get negative 10 here. If you shift it one to the right, you should get negative 2 and then 2 and then 3 and so on. Then fill in all the rest of the elements of the matrix. To be clearer, this -16 would be obtained by from this lower right 3 by 3 region. A 6 by 6 matrix convolve of the 3 by 3 matrix gives you a 4 by 4 matrix. And these are images and filters. These are really just matrices of various dimensions. But the matrix on the left is convenient to interpret as image, and the one in the middle we interpret as a filter and the one on the right, you can interpret that as maybe another image. And this turns out to be a vertical edge detector, and you see why on the next slide. Before going on though, just one other comment, which is that if you implement this in a programming language, then in practice, most foreign languages will have some different functions rather than an asterisk to denote convolution. For example, in the previous exercise, you use or you implement a function called conv-forward. If you do this in tens of flow, there is a function tf.nn.cont2d. And then other deep learning programming frameworks in the CARIS program firmware, we shall see later in this course, there is a function called cont2d that implements convolution and so on. But all the deep learning frameworks that have a good support for computer vision will have some functions for implementing this convolution operator. Why is this doing vertical edge detection? Lets look at another example. To illustrate this, we are going to use a simplified image. Here is a simple 6 by 6 image where the left half of the image is 10 and the right half is zero. If you plot this as a picture, it might look like this, where the left half, the 10s, give you brighter pixel intensive values and the right half gives you darker pixel intensive values. I am using that shade of gray to denote zeros, although maybe it could also be drawn as black. But in this image, there is clearly a very strong vertical edge right down the middle of this image as it transitions from white to black or white to darker color. When you convolve this with the 3 by 3 filter and so this 3 by 3 filter can be visualized as follows, where is lighter, brighter pixels on the left and then this mid tone zeroes in the middle and then darker on the right. What you get is this matrix on the right. Just to verify this math if you want, this zero for example, is obtained by taking the element wise products and then multiplying with this 3 by 3 block and so you get from the left column 10 plus 10 plus 10 and then zeroes in the middle and then -10, -10, -10 which is why you end up with zero over here. Whereas in contrast, if that 30 will be obtained from this, which you get from having 10 plus 10 plus 10 and then minus zero, minus zero which is why you end up with a 30 over there. Now, if you plot this right most matrix's image it will look like that where there is this lighter region right in the middle and that corresponds to this having detected this vertical edge down the middle of your 6 by 6 image. In case the dimensions here seem a little bit wrong that the detected edge seems really thick, that's only because we are working with very small images in this example. And if you are using, say a 1000 by 1000 image rather than a 6 by 6 image then you find that this does a pretty good job, really detecting the vertical edges in your image. In this example, this bright region in the middle is just the output images way of saying that it looks like there is a strong vertical edge right down the middle of the image. Maybe one intuition to take away from vertical edge detection is that a vertical edge is a three by three region since we are using a 3 by 3 filter where there are bright pixels on the left, you do not care that much what is in the middle and dark pixels on the right. The middle in this 6 by 6 image is really where there could be bright pixels on the left and darker pixels on the right and that is why it thinks its a vertical edge over there. The convolution operation gives you a convenient way to specify how to find these vertical edges in an image. You have now seen how the convolution operator works. In the next video, you will see how to take this and use it as one of the basic building blocks of a Convolution Neural Network.

More Edge Detection - 7m

0:02
You've seen how the convolution operation allows you to implement a vertical edge detector. In this video, you'll learn the difference between positive and negative edges, that is, the difference between light to dark versus dark to light edge transitions. And you'll also see other types of edge detectors, as well as how to have an algorithm learn, rather than have us hand code an edge detector as we've been doing so far. So let's get started.
0:31
Here's the example you saw from the previous video, where you have this image, six by six, there's light on the left and dark on the right, and convolving it with the vertical edge detection filter results in detecting the vertical edge down the middle of the image.
0:47
What happens in an image where the colors are flipped, where it is darker on the left and brighter on the right? So the 10s are now on the right half of the image and the 0s on the left. If you convolve it with the same edge detection filter, you end up with negative 30s, instead of 30 down the middle, and you can plot that as a picture that maybe looks like that. So because the shade of the transitions is reversed, the 30s now gets reversed as well. And the negative 30s shows that this is a dark to light rather than a light to dark transition. And if you don't care which of these two cases it is, you could take absolute values of this output matrix. But this particular filter does make a difference between the light to dark versus the dark to light edges.
1:42
Let's see some more examples of edge detection. This three by three filter we've seen allows you to detect vertical edges. So maybe it should not surprise you too much that this three by three filter will allow you to detect horizontal edges. So as a reminder, a vertical edge according to this filter, is a three by three region where the pixels are relatively bright on the left part and relatively dark on the right part. So similarly, a horizontal edge would be a three by three region where the pixels are relatively bright on top and relatively dark in the bottom row. So here's one example, this is a more complex one, where you have here 10s in the upper left and lower right-hand corners. So if you draw this as an image, this would be an image which is going to be darker where there are 0s, so I'm going to shade in the darker regions, and then lighter in the upper left and lower right-hand corners. And if you convolve this with a horizontal edge detector, you end up with this.
2:48
And so just to take a couple of examples, this 30 here corresponds to this three by three region, where indeed there are bright pixels on top and darker pixels on the bottom. It's kind of over here. And so it finds a strong positive edge there. And this -30 here corresponds to this region, which is actually brighter on the bottom and darker on top. So that is a negative edge in this example. And again, this is kind of an artifact of the fact that we're working with relatively small images, that this is just a six by six image. But these intermediate values, like this -10, for example, just reflects the fact that that filter here, it captures part of the positive edge on the left and part of the negative edge on the right, and so blending those together gives you some intermediate value. But if this was a very large, say a thousand by a thousand image with this type of checkerboard pattern, then you won't see these transitions regions of the 10s. The intermediate values would be quite small relative to the size of the image. So in summary, different filters allow you to find vertical and horizontal edges. It turns out that the three by three vertical edge detection filter we've used is just one possible choice. And historically, in the computer vision literature, there was a fair amount of debate about what is the best set of numbers to use. So here's something else you could use, which is maybe 1, 2, 1, 0, 0, 0, -1, -2, -1. This is called a Sobel filter. And the advantage of this is it puts a little bit more weight to the central row, the central pixel, and this makes it maybe a little bit more robust. But computer vision researchers will use other sets of numbers as well, like maybe instead of a 1, 2, 1, it should be a 3, 10, 3, right? And then -3, -10, -3. And this is called a Scharr filter. And this has yet other slightly different properties. And this is just for vertical edge detection. And if you flip it 90 degrees, you get horizontal edge detection. And with the rise of deep learning, one of the things we learned is that when you really want to detect edges in some complicated image, maybe you don't need to have computer vision researchers handpick these nine numbers. Maybe you can just learn them and treat the nine numbers of this matrix as parameters, which you can then learn using back propagation. And the goal is to learn nine parameters so that when you take the image, the six by six image, and convolve it with your three by three filter, that this gives you a good edge detector.
5:50
And what you see in later videos is that by just treating these nine numbers as parameters, the backprop can choose to learn 1, 1, 1, 0, 0, 0, -1,-1, if it wants, or learn the Sobel filter or learn the Scharr filter, or more likely learn something else that's even better at capturing the statistics of your data than any of these hand coded filters. And rather than just vertical and horizontal edges, maybe it can learn to detect edges that are at 45 degrees or 70 degrees or 73 degrees or at whatever orientation it chooses. And so by just letting all of these numbers be parameters and learning them automatically from data, we find that neural networks can actually learn low level features, can learn features such as edges, even more robustly than computer vision researchers are generally able to code up these things by hand. But underlying all these computations is still this convolution operation, Which allows back propagation to learn whatever three by three filter it wants and then to apply it throughout the entire image, at this position, at this position, at this position, in order to output whatever feature it's trying to detect. Be it vertical edges, horizontal edges, or edges at some other angle or even some other filter that we might not even have a name for in English.
7:19
So the idea you can treat these nine numbers as parameters to be learned has been one of the most powerful ideas in computer vision. And later in this course, later this week, we'll actually talk about the details of how you actually go about using back propagation to learn these nine numbers. But first, let's talk about some other details, some other variations, on the basic convolution operation. In the next two videos, I want to discuss with you how to use padding as well as different strides for convolutions. And these two will become important pieces of this convolutional building block of convolutional neural networks. So let's go on to the next video.

Padding - 9m

0:01
In order to build deep neural networks one modification to the basic convolutional operation that you need to really use is padding. Let's see how it works. What we saw in earlier videos is that if you take a six by six image and convolve it with a three by three filter, you end up with a four by four output with a four by four matrix, and that's because the number of possible positions with the three by three filter, there are only, sort of, four by four possible positions, for the three by three filter to fit in your six by six matrix. And the math of this this turns out to be that if you have a end by end image and to involved that with an f by f filter, then the dimension of the output will be; n minus f plus one by n minus f plus one. And in this example, six minus three plus one is equal to four, which is why you wound up with a four by four output. So the two downsides to this; one is that, if every time you apply a convolutional operator, your image shrinks, so you come from six by six down to four by four then, you can only do this a few times before your image starts getting really small, maybe it shrinks down to one by one or something, so maybe, you don't want your image to shrink every time you detect edges or to set other features on it, so that's one downside, and the second downside is that, if you look the pixel at the corner or the edge, this little pixel is touched as used only in one of the outputs, because this touches that three by three region. Whereas, if you take a pixel in the middle, say this pixel, then there are a lot of three by three regions that overlap that pixel and so, is as if pixels on the corners or on the edges are use much less in the output. So you're throwing away a lot of the information near the edge of the image. So, to solve both of these problems, both the shrinking output, and when you build really deep neural networks, you see why you don't want the image to shrink on every step because if you have, maybe a hundred layer of deep net, then it'll shrinks a bit on every layer, then after a hundred layers you end up with a very small image. So that was one problem, the other is throwing away a lot of the information from the edges of the image. So in order to fix both of these problems, what you can do is the full apply of convolutional operation. You can pad the image. So in this case, let's say you pad the image with an additional one border, with the additional border of one pixel all around the edges. So, if you do that, then instead of a six by six image, you've now padded this to eight by eight image and if you convolve an eight by eight image with a three by three image you now get that out. Now, the four by four by the six by six image, so you managed to preserve the original input size of six by six. So by convention when you pad, you padded with zeros and if p is the padding amounts. So in this case, p is equal to one, because we're padding all around with an extra boarder of one pixels, then the output becomes n plus 2p minus f plus one by n plus 2p minus f by one. So, this becomes six plus two times one minus three plus one by the same thing on that. So, six plus two minus three plus one that's equals to six. So you end up with a six by six image that preserves the size of the original image. So this being pixel actually influences all of these cells of the output and so this effective, maybe not by throwing away but counting less the information from the edge of the corner or the edge of the image is reduced. And I've shown here, the effect of padding deep border with just one pixel. If you want, you can also pad the border with two pixels, in which case I guess, you do add on another border here and they can pad it with even more pixels if you choose. So, I guess what I'm drawing here, this would be a padded equals to p plus two. In terms of how much to pad, it turns out there two common choices that are called, Valid convolutions and Same convolutions. Not really is a great names but in a valid convolution, this basically means no padding. And so in this case you might have n by n image convolve with an f by f filter and this would give you an n minus f plus one by n minus f plus one dimensional output. So this is like the example we had previously on the previous videos where we had an n by n image convolve with the three by three filter and that gave you a four by four output. The other most common choice of padding is called the same convolution and that means when you pad, so the output size is the same as the input size. So if we actually look at this formula, when you pad by p pixels then, its as if n goes to n plus 2p and then you have from the rest of this, right? Minus f plus one. So we have an n by n image and the padding of a border of p pixels all around, then the output sizes of this dimension is xn plus 2p minus f plus one. And so, if you want n plus 2p minus f plus one to be equal to one, so the output size is same as input size, if you take this and solve for, I guess, n cancels out on both sides and if you solve for p, this implies that p is equal to f minus one over two. So when f is odd, by choosing the padding size to be as follows, you can make sure that the output size is same as the input size and that's why, for example, when the filter was three by three as this had happened in the previous slide, the padding that would make the output size the same as the input size was three minus one over two, which is one. And as another example, if your filter was five by five, so if f is equal to five, then, if you pad it into that equation you find that the padding of two is required to keep the output size the same as the input size when the filter is five by five. And by convention in computer vision, f is usually odd. It's actually almost always odd and you rarely see even numbered filters, filter works using computer vision. And I think that two reasons for that; one is that if f was even, then you need some asymmetric padding. So only if f is odd that this type of same convolution gives a natural padding region, had the same dimension all around rather than pad more on the left and pad less on the right, or something that asymmetric. And then second, when you have an odd dimension filter, such as three by three or five by five, then it has a central position and sometimes in computer vision its nice to have a distinguisher, it's nice to have a pixel, you can call the central pixel so you can talk about the position of the filter. Right, maybe none of this is a great reason for using f to be pretty much always odd but if you look a convolutional literature you see three by three filters are very common. You see some five by five, seven by sevens. And actually sometimes, later we'll also talk about one by one filters and that why that makes sense. But just by convention, I recommend you just use odd number filters as well. I think that you can probably get just fine performance even if you want to use an even number value for f, but if you stick to the common computer vision convention, I usually just use odd number f. So you've now seen how to use padded convolutions. To specify the padding for your convolution operation, you can either specify the value for p or you can just say that this is a valid convolution, which means p equals zero or you can say this is a same convolution, which means pad as much as you need to make sure the output has same dimension as the input. So that's it for padding. In the next video, let's talk about how you can implement Strided convolutions.

Strided Convolutions - 9m

0:00
Strided convolutions is another piece of the basic building block of convolutions as used in Convolutional Neural Networks. Let me show you an example. Let's say you want to convolve this seven by seven image with this three by three filter, except that instead of doing the usual way, we are going to do it with a stride of two. What that means is you take the element Y's product as usual in this upper left three by three region and then multiply and add and that gives you 91. But then instead of stepping the blue box over by one step, we are going to step over by two steps. So, we are going to make it hop over two steps like so. Notice how the upper left hand corner has gone from this start to this start, jumping over one position. And then you do the usual element Y's product and summing it turns out 100. And now we are going to do they do that again, and make the blue box jump over by two steps. You end up there, and that gives you 83. Now, when you go to the next row, you again actually take two steps instead of one step so going to move the blue box over there. Notice how we are stepping over one of the positions and then this gives you 69, and now you again step over two steps, this gives you 91 and so on so 127. And then for the final row 44, 72, and 74. In this example, we convolve with a seven by seven matrix to this three by three matrix and we get a three by three outputs. The input and output dimensions turns out to be governed by the following formula, if you have an N by N image, they convolve with an F by F filter. And if you use padding P and stride S. In this example, S is equal to two then you end up with an output that is N plus two P minus F, and now because you're stepping S steps of the time, you step just one step of the time, you now divide by S plus one and then can apply the same thing. In our example, we have seven plus zero, minus three, divided by two S stride plus one equals let's see, that's four over two plus one equals three, which is why we wound up with this is three by three output. Now, just one last detail which is what of this fraction is not an integer? In that case, we're going to round this down so this notation denotes the flow of something. This is also called the flow of Z. It means taking Z and rounding down to the nearest integer. The way this is implemented is that you take this type of blue box multiplication only if the blue box is fully contained within the image or the image plus to the padding and if any of this blue box kind of part of it hangs outside and you just do not do that computation. Then it turns out that if that's the convention that your three by three filter, must lie entirely within your image or the image plus the padding region before there's as a corresponding output generated that's convention. Then the right thing to do to compute the output dimension is to round down in case this N plus two P minus F over S is not an integer. Just to summarize the dimensions, if you have an N by N matrix or N by N image that you convolve with an F by F matrix or F by F filter with padding P N stride S, then the output size will have this dimension. It is nice we can choose all of these numbers so that there is an integer although sometimes you don't have to do that and rounding down is just fine as well. But please feel free to work through a few examples of values of N, F, P and S on yourself to convince yourself if you want, that this formula is correct for the output size. Now, before moving on there is a technical comment I want to make about cross-correlation versus convolutions and just for the facts what you have to do to implement convolutional neural networks. If you reading different math textbook or signal processing textbook, there is one other possible inconsistency in the notation which is that, if you look at the typical math textbook, the way that the convolution is defined before doing the element Y's product and summing, there's actually one other step that you'll first take which is to convolve this six by six matrix with this three by three filter. You at first take the three by three filter and slip it on the horizontal as well as the vertical axis so this 345102 minus 197, will become, three goes here, four goes there, five goes there and then the second row becomes this,102 minus 197. Well, this is really taking the three by three filter and narrowing it both on the vertical and horizontal axes. And then it was this flit matrix that you would then copy over here. To compute the output, you will take two times seven, plus three times two, plus seven times five and so on. I should multiply out the elements of this flit matrix in order to compute the upper left hand rows elements of the four by four output as follows. Then you take those nine numbers and shift them over by one shift them over by one and so on. The way we've define the convolution operation in this video is that we've skipped this narrowing operation. Technically, what we're actually doing, the operation we've been using for the last few videos is sometimes cross-correlation instead of convolution. But in the deep learning literature by convention, we just call this a convolutional operation. Just to summarize, by convention in machine learning, we usually do not bother with this skipping operation and technically, this operation is maybe better called cross-correlation but most of the deep learning literature just calls it the convolution operator. And so I'm going to use that convention in these videos as well, and if you read a lot of the machines learning literature, you'll find most people just call this the convolution operator without bothering to use these slips. It turns out that in signal processing or in certain branches of mathematics, doing the flipping in the definition of convolution causes convolution operator to enjoy this property that A convolve with B, convolve with C is equal to A convolve with B, convolve with C, and this is called associativity in mathematics. This is nice for some signal processing applications but for deep neural networks it really doesn't matter and so omitting this double mirroring operation just simplifies the code and makes the neural networks work just as well. And by convention, most of us just call this convolution or even though the mathematicians prefer to call this cross-correlation sometimes. But this should not affect anything you have to implement in the problem exercises and should not affect your ability to read and understand the deep learning literature. You've now seen how to carry out convolutions and you've seen how to use padding as well as strides to convolutions. But so far, all we've been using is convolutions over matrices, like over a six by six matrix. In the next video, you'll see how to carry out convolutions over volumes and this would make what you can do a convolutions sounds really much more powerful. Let's go on to the next video.

Convolutions Over Volume - 10m

0:01
You've seen how convolutions over 2D images works. Now, let's see how you can implement convolutions over, not just 2D images, but over three dimensional volumes. Let's start with an example, let's say you want to detect features, not just in a great scale image, but in a RGB image. So, an RGB image might be instead of a six by six image, it could be six by six by three, where the three here responds to the three color channels. So, you think of this as a stack of three six by six images. In order to detect edges or some other feature in this image, you can vault this, not with a three by three filter, as we have previously, but now with also with a 3D filter, that's going to be three by three by three. So the filter itself will also have three layers corresponding to the red, green, and blue channels. So to give these things some names, this first six here, that's the height of the image, that's the width, and this three is the number of channels. And your filter also similarly has a height, a width, and the number of channels. And the number of channels in your image must match the number of channels in your filter, so these two numbers have to be equal. We'll see on the next slide how this convolution operation actually works, but the output of this will be a four by four image. And notice this is four by four by one, there's no longer a three at the end. Let's go through in detail how this works but let's use a more nicely drawn image. So here's the six by six by three image, and here's a three by three by three filter, and this last number, the number of channels matches the 3D image and the filter. So to simplify the drawing of this three by three by three filter, instead of joining it is a stack of the matrices, I'm also going to, sometimes, just draw it as this three dimensional cube, like that. So to compute the output of this convolutional operation, what you would do is take the three by three by three filter and first, place it in that upper left most position. So, notice that this three by three by three filter has 27 numbers, or 27 parameters, that's three cubes. And so, what you do is take each of these 27 numbers and multiply them with the corresponding numbers from the red, green, and blue channels of the image, so take the first nine numbers from red channel, then the three beneath it to the green channel, then the three beneath it to the blue channel, and multiply it with the corresponding 27 numbers that gets covered by this yellow cube show on the left. Then add up all those numbers and this gives you this first number in the output, and then to compute the next output you take this cube and slide it over by one, and again, due to 27 multiplications, add up the 27 numbers, that gives you this next output, do it for the next number over, for the next position over, that gives the third output and so on. That dives you the forth and then one row down and then the next one, to the next one, to the next one, and so on, you get the idea, until at the very end, that's the position you'll have for that final output. So, what does this allow you to do? Well, here's an example, this filter is three by three by three. So, if you want to detect edges in the red channel of the image, then you could have the first filter, the one, one, one, one is one, one is one, one is one as usual, and have the green channel be all zeros, and have the blue filter be all zeros. And if you have these three stock together to form your three by three by three filter, then this would be a filter that detect edges, vertical edges but only in the red channel. Alternatively, if you don't care what color the vertical edge is in, then you might have a filter that's like this, whereas this one, one, one, minus one, minus one, minus one, in all three channels. So, by setting this second alternative, set the parameters, you then have a edge detector, a three by three by three edge detector, that detects edges in any color. And with different choices of these parameters you can get different feature detectors out of this three by three by three filter. And by convention, in computer vision, when you have an input with a certain height, a certain width, and a certain number of channels, then your filter will have a potential different height, different width, but the same number of channels. And in theory it's possible to have a filter that maybe only looks at the red channel or maybe a filter looks at only the green channel and a blue channel. And once again, you notice th\t convolving a volume, a six by six by three convolve with a three by three by three, that gives a four by four, a 2D output. Now that you know how to convolve on volumes, there is one last idea that will be crucial for building convolutional neural networks, which is what if we don't just wanted to detect vertical edges? What if we wanted to detect vertical edges and horizontal edges and maybe 45 degree edges and maybe 70 degree edges as well, but in other words, what if you want to use multiple filters at the same time? So, here's the picture we had from the previous slide, we had six by six by three convolved with the three by three by three, gets four by four, and maybe this is a vertical edge detector, or maybe it's run to detect some other feature. Now, maybe a second filter may be denoted by this orange-ish color, which could be a horizontal edge detector. So, maybe convolving it with the first filter gives you this first four by four output and convolving with the second filter gives you a different four by four output. And what we can do is then take these two four by four outputs, take this first one within the front and you can take this second filter output and well, let me draw it here, put it at back as follows, so that by stacking these two together, you end up with a four by four by two output volume, right? And you can think of the volume as if we draw this is a box, I guess it would look like this. So this would be a four by four by two output volume, which is the result of taking your six by six by three image and convolving it or applying two different three by three filters to it, resulting in two four by four outputs that then gets stacked up to form a four by four by two volume. And the two here comes from the fact that we used two different filters. So, let's just summarize the dimensions, if you have a n by n by number of channels input image, so an example, there's a six by six by three, where n subscript C is the number of channels, and you convolve that with a f by f by, and again, this should be the same nC, so this was, three by three by three, and by convention this and this have to be the same number. Then, what you get is n minus f plus one by n minus f plus one by and you want to use this nC prime, or its really nC of the next layer, but this is the number of filters that you use. So this in our example would be be four by four by two. And I wrote this assuming that you use a stride of one and no padding. But if you used a different stride of padding than this n minus F plus one would be affected in a usual way, as we see in the previous videos. So this idea of convolution on volumes, turns out to be really powerful. Only a small part of it is that you can now operate directly on RGB images with three channels. But even more important is that you can now detect two features, like vertical, horizontal edges, or 10, or maybe a 128, or maybe several hundreds of different features. And the output will then have a number of channels equal to the number of filters you are detecting. And as a note of notation, I've been using your number of channels to denote this last dimension in the literature, people will also often call this the depth of this 3D volume and both notations, channels or depth, are commonly used in the literature. But they find depth more confusing because you usually talk about the depth of the neural network as well, so I'm going to use the term channels in these videos to refer to the size of this third dimension of these filters. So now that you know how to implement convolutions over volumes, you now are ready to implement one layer of the convolutional neural network. Let's see how to do that in the next video.

One Layer of a Convolutional Network - 16m

0:03
Get now ready to see how to build one layer of a convolutional neural network, let's go through the example.
0:12
You've seen at the previous video how to take a 3D volume and convolve it with say two different filters.
0:21
In order to get in this example to different 4 by 4 outputs.
0:30
So let's say convolving with the first filter gives this first 4 by 4 output, and convolving with this second filter gives a different 4 by 4 output. The final thing to turn this into a convolutional neural net layer, is that for each of these we're going to add it bias, so this is going to be a real number. And where python broadcasting, you kind of have to add the same number so every one of these 16 elements. And then apply a non-linearity which for this illustration that says relative non-linearity, and this gives you a 4 by 4 output, all right? After applying the bias and the non-linearity. And then for this thing at the bottom as well, you add some different bias, again, this is a real number. So you add the single number to all 16 numbers, and then apply some non-linearity, let's say a real non-linearity. And this gives you a different 4 by 4 output. Then same as we did before, if we take this and stack it up as follows, so we ends up with a 4 by 4 by 2 outputs. Then this computation where you come from a 6 by 6 by 3 to 4 by 4 by 4, this is one layer of a convolutional neural network. So to map this back to one layer of four propagation in the standard neural network, in a non-convolutional neural network. Remember that one step before the prop was something like this, right? z1 = w1 times a0, a 0 was also equal to x, and then plus b[1]. And you apply the non-linearity to get a[1], so that's g(z[1]). So this input here, in this analogy this is a[0], this is x3.
2:44
And these filters here, this plays a role similar to w1. And you remember during the convolution operation, you were taking these 27 numbers, or really well, 27 times 2, because you have two filters. You're taking all of these numbers and multiplying them. So you're really computing a linear function to get this 4 x 4 matrix. So that 4 x 4 matrix, the output of the convolution operation, that plays a rolesimilar to w1 times a0. That's really maybe the output of this 4 x 4 as well as that 4 x 4. And then the other thing you do is add the bias. So, this thing here before applying value, this plays a role similar to z. And then it's finally by applying the non-linearity, this kind of this I guess. So, this output plays a role, this really becomes your activation at the next layer. So this is how you go from a0 to a1, as far as tthe linear operation and then convolution has all these multipled. So the convolution is really applying a linear operation and you have the biases and the applied value operation. And you've gone from a 6 by 6 by 3, dimensional a0, through one layer of neural network to, I guess a 4 by 4 by 2 dimensional a(1). And so 6 by 6 by 3 has gone to 4 by 4 by 2, and so that is one layer of convolutional net.
4:33
Now in this example we have two filters, so we had two features of you will, which is why we wound up with our output 4 by 4 by 2. But if for example we instead had 10 filters instead of 2, then we would have wound up with the 4 by 4 by 10 dimensional output volume. Because we'll be taking 10 of these naps not just two of them, and stacking them up to form a 4 by 4 by 10 output volume, and that's what a1 would be. So, to make sure you understand this, let's go through an exercise. Let's suppose you have 10 filters, not just two filters, that are 3 by 3 by 3 and 1 layer of a neural network, how many parameters does this layer have?
5:21
Well, let's figure this out. Each filter, is a 3 x 3 x 3 volume, so 3 x 3 x 3, so each fill has 27 parameters, all right? There's 27 numbers to be run, and plus the bias.
5:42
So that was the b parameter, so this gives you 28 parameters.
5:50
And then if you imagine that on the previous slide we had drawn two filters, but now if you imagine that you actually have ten of these, right? 1, 2..., 10 of these, then all together you'll have 28 times 10, so that will be 280 parameters. Notice one nice thing about this, is that no matter how big the input image is, the input image could be 1,000 by 1,000 or 5,000 by 5,000, but the number of parameters you have still remains fixed as 280. And you can use these ten filters to detect features, vertical edges, horizontal edges maybe other features anywhere even in a very, very large image is just a very small number of parameters.
6:40
So these is really one property of convolution neural network that makes less prone to overfitting then if you could. So once you've learned 10 feature detectors that work, you could apply this even to large images. And the number of parameters still is fixed and relatively small, as 280 in this example. All right, so to wrap up this video let's just summarize the notation we are going to use to describe one layer to describe a covolutional layer in a convolutional neural network. So layer l is a convolution layer, l am going to use f superscript,[l] to denote the filter size. So previously we've been seeing the filters are f by f, and now this superscript square bracket l just denotes that this is a filter size of f by f filter layer l. And as usual the superscript square bracket l is the notation we're using to refer to particular layer l.
7:39
going to use p[l] to denote the amount of padding. And again, the amount of padding can also be specified just by saying that you want a valid convolution, which means no padding, or a same convolution which means you choose the padding. So that the output size has the same height and width as the input size.
7:59
And then you're going to use s[l] to denote the stride.
8:03
Now, the input to this layer is going to be some dimension. It's going be some n by n by number of channels in the previous layer. Now, I'm going to modify this notation a little bit. I'm going to us superscript l- 1, because that's the activation from the previous layer, l- 1 times nc of l- 1. And in the example so far, we've been just using images of the same height and width. That in case the height and width might differ, l am going to use superscript h and superscript w, to denote the height and width of the input of the previous layer, all right? So in layer l, the size of the volume will be nh by nw by nc with superscript squared bracket l. It's just in layer l, the input to this layer Is whatever you had for the previous layer, so that's why you have l- 1 there. And then this layer of the neural network will itself output the value. So that will be nh of l by nw of l, by nc of l, that will be the size of the output. And so whereas we approve this set that the output volume size or at least the height and weight is given by this formula, n + 2p- f over s + 1, and then take the full of that and round it down. In this new notation what we have is that the outputs value that's in layer l, is going to be the dimension from the previous layer, plus the padding we're using in this layer l, minus the filter size we're using this layer l and so on. And technically this is true for the height, right? So the height of the output volume is given by this, and you can compute it with this formula on the right, and the same is true for the width as well. So you cross out h and throw in w as well, then the same formula with either the height or the width plugged in for computing the height or width of the output value.
10:36
So that's how nhl -1 relates to nhl and wl- 1 relates to nwl. Now, how about the number of channels, where did those numbers come from? Let's take a look, if the output volume has this depth, while we know from the previous examples that that's equal to the number of filters we have in that layer, right? So we had two filters, the output value was 4 by 4 by 2, was 2 dimensional. And if you had 10 filters and your upper volume was 4 by 4 by 10. So, this the number of channels in the output value, that's just the number of filters we're using in this layer of the neural network. Next, how about the size of this filter? Well, each filter is going to be fl by fl by 100 number, right? So what is this last number? Well, we saw that you needed to convolve a 6 by 6 by 3 image, with a 3 by 3 by 3 filter.
11:43
And so the number of channels in your filter, must match the number of channels in your input, so this number should match that number, right? Which is why each filter is going to be f(l) by f(l) by nc(l-1). And the output of this layer often apply devices in non-linearity, is going to be the activations of this layer al. And that we've already seen will be this dimension, right? The al will be a 3D volume, that's nHl by nwl by ncl. And when you are using a vectorized implementation or batch gradient descent or mini batch gradient descent, then you actually outputs Al, which is a set of m activations, if you have m examples. So that would be M by nHl, by nwl by ncl right? If say you're using bash grading decent and in the programming sizes this will be ordering of the variables. And we have the index and the trailing examples first, and then these three variables. Next how about the weights or the parameters, or kind of the w parameter? Well we saw already what the filter dimension is. So the filters are going to be f[l] by f[l] by nc [l- 1], but that's the dimension of one filter. How many filters do we have? Well, this is a total number of filters, so the weights really all of the filters put together will have dimension given by this, times the total number of filters, right? Because this, Last quantity is the number of filters, In layer l.
13:45
And then finally, you have the bias parameters, and you have one bias parameter, one real number for each filter. So you're going to have, the bias will have this many variables, it's just a vector of this dimension. Although later on we'll see that the code will be more convenient represented as 1 by 1 by 1 by nc[l] four dimensional matrix, or four dimensional tensor.
14:16
So I know that was a lot of notation, and this is the convention I'll use for the most part. I just want to mention in case you search online and look at open source code. There isn't a completely universal standard convention about the ordering of height, width, and channel. So If you look on source code on GitHub or these open source implementations, you'll find that some authors use this order instead, where you first put the channel first, and you sometimes see that ordering of the variables. And in fact in some common frameworks, actually in multiple common frameworks, there's actually a variable or a parameter. Why do you want to list the number of channels first, or list the number of channels last when indexing into these volumes. I think both of these conventions work okay, so long as you're consistent. And unfortunately maybe this is one piece of annotation where there isn't consensus in the deep learning literature but i'm going to use this convention for these videos.
15:24
Where we list height and width and then the number of channels last. So I know there was certainly a lot of new notations you could use, but you're thinking wow, that's a long notation, how do I need to remember all of these? Don't worry about it, you don't need to remember all of this notation, and through this week's exercises you become more familiar with it at that time. But the key point I hope you take a way from this video, is just one layer of how convolutional neural network works. And the computations involved in taking the activations of one layer and mapping that to the activations of the next layer. And next, now that you know how one layer of the compositional neural network works, let's stack a bunch of these together to actually form a deeper compositional neural network. Let's go on to the next video to see,

Simple Convolutional Network Example - 8m

0:00
In the last video, you saw the building blocks of a single layer, of a single convolution layer in the ConvNet. Now let's go through a concrete example of a deep convolutional neural network. And this will give you some practice with the notation that we introduced toward the end of the last video as well.
0:19
Let's say you have an image, and you want to do image classification, or image recognition. Where you want to take as input an image, x, and decide is this a cat or not, 0 or 1, so it's a classification problem. Let's build an example of a ConvNet you could use for this task. For the sake of this example, I'm going to use a fairly small image. Let's say this image is 39 x 39 x 3. This choice just makes some of the numbers work out a bit better. And so, nH in layer 0 will be equal to nw height and width are equal to 39 and the number of channels and layer 0 is equal to 3. Let's say the first layer uses a set of 3 by 3 filters to detect features, so f = 3 or really f1 = 3, because we're using a 3 by 3 process. And let's say we're using a stride of 1, and no padding. So using a same convolution, and let's say you have 10 filters.
1:34
Then the activations in this next layer of the neutral network will be 37 x 37 x 10, and this 10 comes from the fact that you use 10 filters. And 37 comes from this formula n + 2p- f over s + 1. Right, then I guess you have 39 + 0- 3 over 1 + 1 that's = to 37. So that's why the output is 37 by 37, it's a valid convolution and that's the output size. So in our notation you would have nh[1] = nw[1] = 37 and nc[1] = 10, so nc[1] is also equal to the number of filters from the first layer. And so this becomes the dimension of the activation at the first layer.
2:43
Let's say you now have another convolutional layer and let's say this time you use 5 by 5 filters. So, in our notation f[2] at the next neural network = 5, and let's say use a stride of 2 this time. And maybe you have no padding and say, 20 filters.
3:09
So then the output of this will be another volume, this time it will be 17 x 17 x 20. Notice that, because you're now using a stride of 2, the dimension has shrunk much faster. 37 x 37 has gone down in size by slightly more than a factor of 2, to 17 x 17. And because you're using 20 filters, the number of channels now is 20. So it's this activation a2 would be that dimension and so nh[2] = nw[2] = 17 and nc[2] = 20. All right, let's apply one last convolutional layer. So let's say that you use a 5 by 5 filter again, and again, a stride of 2. So if you do that, I'll skip the math, but you end up with a 7 x 7, and let's say you use 40 filters, no padding, 40 filters. You end up with 7 x 7 x 40. So now what you've done is taken your 39 x 39 x 3 input
4:29
image and computed your 7 x 7 x 40 features for this image. And then finally, what's commonly done is if you take this 7 x 7 x 40, 7 times 7 times 40 is actually 1,960. And so what we can do is take this volume and flatten it or unroll it into just 1,960 units, right? Just flatten it out into a vector, and then feed this to a logistic regression unit, or a softmax unit.
5:07
Depending on whether you're trying to recognize or trying to recognize any one of key different objects and then just have this give the final predicted output for the neural network.
5:20
So just be clear, this last step is just taking all of these numbers, all 1,960 numbers, and unrolling them into a very long vector. So then you just have one long vector that you can feed into softmax until it's just a regression in order to make prediction for the final output.
5:41
So this would be a pretty typical example of a ConvNet.
5:47
A lot of the work in designing convolutional neural net is selecting hyperparameters like these, deciding what's the total size? What's the stride? What's the padding and how many filters are used?
6:00
And both later this week as well as next week, we'll give some suggestions and some guidelines on how to make these choices. But for now, maybe one thing to take away from this is that as you go deeper in a neural network, typically you start off with larger images, 39 by 39. And then the height and width will stay the same for a while and gradually trend down as you go deeper in the neural network. It's gone from 39 to 37 to 17 to 14. Excuse me, it's gone from 39 to 37 to 17 to 7. Whereas the number of channels will generally increase. It's gone from 3 to 10 to 20 to 40, and you see this general trend in a lot of other convolutional neural networks as well.
6:47
So we'll get more guidelines about how to design these parameters in later videos. But you've now seen your first example of a convolutional neural network, or a ConvNet for short. So congratulations on that.
7:02
And it turns out that in a typical ConvNet, there are usually three types of layers. One is the convolutional layer, and often we'll denote that as a Conv layer. And that's what we've been using in the previous network. It turns out that there are two other common types of layers that you haven't seen yet but we'll talk about in the next couple of videos. One is called a pooling layer, often I'll call this pool. And then the last is a fully connected layer called FC. And although it's possible to design a pretty good neural network using just convolutional layers, most neural network architectures will also have a few pooling layers and a few fully connected layers.
7:46
Fortunately pooling layers and fully connected layers are a bit simpler than convolutional layers to define.
7:54
So we'll do that quickly in the next two videos and then you have a sense of all of the most common types of layers in a convolutional neural network. And you will put together even more powerful networks than the one we just saw.
8:08
So congrats again on seeing your first full convolutional neural network. We'll also talk later in this week about how to train these networks, but first let's talk briefly about pooling and fully connected layers. And then training these, we'll be using back propagation, which you're already familiar with. But in the next video, let's quickly go over how to implement a pooling layer for your ConvNet.

Pooling Layers - 10m

0:00
Other than convolutional layers, ConvNets often also use pooling layers to reduce the size of the representation, to speed the computation, as well as make some of the features that detects a bit more robust. Let's take a look. Let's go through an example of pooling, and then we'll talk about why you might want to do this. Suppose you have a four by four input, and you want to apply a type of pooling called max pooling. And the output of this particular implementation of max pooling will be a two by two output. And the way you do that is quite simple. Take your four by four input and break it into different regions and I'm going to color the four regions as follows. And then, in the output, which is two by two, each of the outputs will just be the max from the corresponding reshaded region. So the upper left, I guess, the max of these four numbers is nine. On upper right, the max of the blue numbers is two. Lower left, the biggest number is six, and lower right, the biggest number is three. So to compute each of the numbers on the right, we took the max over a two by two regions. So, this is as if you apply a filter size of two because you're taking a two by two regions and you're taking a stride of two. So, these are actually the hyperparameters of max pooling because we start from this filter size. It's like a two by two region that gives you the nine. And then, you step all over two steps to look at this region, to give you the two, and then for the next row, you step it down two steps to give you the six, and then step to the right by two steps to give you three. So because the squares are two by two, f is equal to two, and because you stride by two, s is equal to two. So here's the intuition behind what max pooling is doing. If you think of this four by four region as some set of features, the activations in some layer of the neural network, then a large number, it means that it's maybe detected a particular feature. So, the upper left-hand quadrant has this particular feature. It maybe a vertical edge or maybe a higher or whisker if you trying to detect a [inaudible]. Clearly, that feature exists in the upper left-hand quadrant. Whereas this feature, maybe it isn't cat eye detector. Whereas this feature, it doesn't really exist in the upper right-hand quadrant. So what the max operation does is a lots of features detected anywhere, and one of these quadrants , it then remains preserved in the output of max pooling. So, what the max operates to does is really to say, if these features detected anywhere in this filter, then keep a high number. But if this feature is not detected, so maybe this feature doesn't exist in the upper right-hand quadrant. Then the max of all those numbers is still itself quite small. So maybe that's the intuition behind max pooling. But I have to admit, I think the main reason people use max pooling is because it's been found in a lot of experiments to work well, and the intuition I just described, despite it being often cited, I don't know of anyone fully knows if that is the real underlying reason. I don't have anyone knows if that's the real underlying reason that max pooling works well in ConvNets. One interesting property of max pooling is that it has a set of hyperparameters but it has no parameters to learn. There's actually nothing for gradient descent to learn. Once you fix f and s, it's just a fixed computation and gradient descent doesn't change anything. Let's go through an example with some different hyperparameters. Here, I am going to use, sure you have a five by five input and we're going to apply max pooling with a filter size that's three by three. So f is equal to three and let's use a stride of one. So in this case, the output size is going to be three by three. And the formulas we had developed in the previous videos for figuring out the output size for conv layer, those formulas also work for max pooling. So, that's n plus 2p minus f over s plus 1. That formula also works for figuring out the output size of max pooling. But in this example, let's compute each of the elements of this three by three output. The upper left-hand elements, we're going to look over that region. So notice this is a three by three region because the filter size is three and to the max there. So, that will be nine, and then we shifted over by one because which you can stride at one. So, that max in the blue box is nine. Let's shift that over again. The max of the blue box is five. And then let's go on to the next row, a stride of one. So we're just stepping down by one step. So max in that region is nine, max in that region is nine, max in that region, it's now with a two fives, we have maxes of five. And then finally, max in that is eight. Max in that is six, and max in that, this is not [inaudible]. Okay, so this, with this set of hyperparameters f equals three, s equals one gives that output shown [inaudible]. Now, so far, I've shown max pooling on a 2D inputs. If you have a 3D input, then the outputs will have the same dimension. So for example, if you have five by five by two, then the output will be three by three by two and the way you compute max pooling is you perform the computation we just described on each of the channels independently. So the first channel which is shown here on top is still the same, and then for the second channel, I guess, this one that I just drew at the bottom, you would do the same computation on that slice of this value and that gives you the second slice. And more generally, if this was five by five by some number of channels, the output would be three by three by that same number of channels. And the max pooling computation is done independently on each of these N_C channels. So, that's max pooling. This one is the type of pooling that isn't used very often, but I'll mention briefly which is average pooling. So it does pretty much what you'd expect which is, instead of taking the maxes within each filter, you take the average. So in this example, the average of the numbers in purple is 3.75, then there is 1.25, and four and two. And so, this is average pooling with hyperparameters f equals two, s equals two, we can choose other hyperparameters as well. So these days, max pooling is used much more often than average pooling with one exception, which is sometimes very deep in a neural network. You might use average pooling to collapse your representation from say, 7 by 7 by 1,000. An average over all the [inaudible] , you get 1 by 1 by 1,000. We'll see an example of this later. But you see, max pooling used much more in the neural network than average pooling. So just to summarize, the hyperparameters for pooling are f, the filter size and s, the stride, and maybe common choices of parameters might be f equals two, s equals two. This is used quite often and this has the effect of roughly shrinking the height and width by a factor of above two, and a common chosen hyperparameters might be f equals two, s equals two, and this has the effect of shrinking the height and width of the representation by a factor of two. I've also seen f equals three, s equals two used, and then the other hyperparameter is just like a binary bit that says, are you using max pooling or are you using average pooling. If you want, you can add an extra hyperparameter for the padding although this is very, very rarely used. When you do max pooling, usually, you do not use any padding, although there is one exception that we'll see next week as well. But for the most parts of max pooling, usually, it does not use any padding. So, the most common value of p by far is p equals zero. And the input of max pooling is that you input a volume of size that, N_H by N_W by N_C, and it would output a volume of size given by this. So assuming there's no padding by N_W minus f over s, this one for by N_C. So the number of input channels is equal to the number of output channels because pooling applies to each of your channels independently. One thing to note about pooling is that there are no parameters to learn. So, when we implement that crop, you find that there are no parameters that backdrop will adapt through max pooling. Instead, there are just these hyperparameters that you set once, maybe set ones by hand or set using cross-validation. And then beyond that, you are done. Its just a fixed function that the neural network computes in one of the layers, and there is actually nothing to learn. It's just a fixed function. So, that's it for pooling. You now know how to build convolutional layers and pooling layers. In the next video, let's see a more complex example of a ConvNet. One that will also allow us to introduce fully connected layers.

CNN Example - 12m

0:00
You now know pretty much all the building blocks of building a full convolutional neural network. Let's look at an example. Let's say you're inputting an image which is 32 x 32 x 3, so it's an RGB image and maybe you're trying to do handwritten digit recognition. So you have a number like 7 in a 32 x 32 RGB initiate trying to recognize which one of the 10 digits from zero to nine is this. Let's throw the neural network to do this. And what I'm going to use in this slide is inspired, it's actually quite similar to one of the classic neural networks called LeNet-5, which is created by Yann LeCun many years ago. What I'll show here isn't exactly LeNet-5 but it's inspired by it, but many parameter choices were inspired by it. So with a 32 x 32 x 3 input let's say that the first layer uses a 5 x 5 filter and a stride of 1, and no padding. So the output of this layer, if you use 6 filters would be 28 x 28 x 6, and we're going to call this layer conv 1. So you apply 6 filters, add a bias, apply the non-linearity, maybe a real non-linearity, and that's the conv 1 output. Next, let's apply a pooling layer, so I am going to apply mass pooling here and let's use a f=2, s=2. When I don't write a padding use a pad easy with a 0. Next let's apply a pooling layer, I am going to apply, let's see max pooling with a 2 x 2 filter and the stride equals 2. So this is should reduce the height and width of the representation by a factor of 2. So 28 x 28 now becomes 14 x 14, and the number of channels remains the same so 14 x 14 x 6, and we're going to call this the Pool 1 output. So, it turns out that in the literature of a ConvNet there are two conventions which are inside the inconsistent about what you call a layer. One convention is that this is called one layer. So this will be layer one of the neural network, and now the conversion will be to call they convey layer as a layer and the pool layer as a layer. When people report the number of layers in a neural network usually people just record the number of layers that have weight, that have parameters. And because the pooling layer has no weights, has no parameters, only a few hyper parameters, I'm going to use a convention that Conv 1 and Pool 1 shared together. I'm going to treat that as Layer 1, although sometimes you see people maybe read articles online and read research papers, you hear about the conv layer and the pooling layer as if they are two separate layers. But this is maybe two slightly inconsistent notation terminologies, but when I count layers, I'm just going to count layers that have weights. So achieve both of this together as Layer 1. And the name Conv1 and Pool1 use here the 1 at the end also refers the fact that I view both of this is part of Layer 1 of the neural network. And Pool 1 is grouped into Layer 1 because it doesn't have its own weights. Next, given a 14 x 14 bx 6 volume, let's apply another convolutional layer to it, let's use a filter size that's 5 x 5, and let's use a stride of 1, and let's use 10 filters this time. So now you end up with, A 10 x 10 x 10 volume, so I'll call this Comv 2, and then in this network let's do max pulling with f=2, s=2 again. So you could probably guess the output of this, f=2, s=2, this should reduce the height and width by a factor of 2, so you're left with 5 x 5 x 10. And so I'm going to call this Pool 2, and in our convention this is Layer 2 of the neural network. Now let's apply another convolutional layer to this. I'm going to use a 5 x 5 filter, so f = 5, and let's try this, 1, and I don't write the padding, means there's no padding. And this will give you the Conv 2 output, and that's your 16 filters. So this would be a 10 x 10 x 16 dimensional output. So we look at that, and this is the Conv 2 layer. And then let's apply max pooling to this with f=2, s=2. You can probably guess the output of this, we're at 10 x 10 x 16 with max pooling with f=2, s=2. This will half the height and width, you can probably guess the result of this, right? Left pooling with f = 2, s = 2. This should halve the height and width so you end up with a 5 x 5 x 16 volume, same number of channels as before. We're going to call this Pool 2. And in our convention this is Layer 2 because this has one set of weights and your Conv 2 layer. Now 5 x 5 x 16, 5 x 5 x 16 is equal to 400. So let's now fatten our Pool 2 into a 400 x 1 dimensional vector. So think of this as fatting this up into these set of neurons, like so. And what we're going to do is then take these 400 units and let's build the next layer, As having 120 units. So this is actually our first fully connected layer. I'm going to call this FC3 because we have 400 units densely connected to 120 units.
6:46
So this fully connected unit, this fully connected layer is just like the single neural network layer that you saw in Courses 1 and 2. This is just a standard neural network where you have a weight matrix that's called W3 of dimension 120 x 400. And this is fully connected because each of the 400 units here is connected to each of the 120 units here, and you also have the bias parameter, yes that's going to be just a 120 dimensional, this is 120 outputs. And then lastly let's take 120 units and add another layer, this time smaller but let's say we had 84 units here, I'm going to call this fully connected Layer 4. And finally we now have 84 real numbers that you can fit to a [INAUDIBLE] unit. And if you're trying to do handwritten digital recognition, to recognize this hand it is 0, 1, 2, and so on up to 9. Then this would be a softmax with 10 outputs. So this is a vis-a-vis typical example of what a convolutional neural network might look like. And I know this seems like there a lot of hyper parameters. We'll give you some more specific suggestions later for how to choose these types of hyper parameters. Maybe one common guideline is to actually not try to invent your own settings of hyper parameters, but to look in the literature to see what hyper parameters you work for others. And to just choose an architecture that has worked well for someone else, and there's a chance that will work for your application as well. We'll see more about that next week. But for now I'll just point out that as you go deeper in the neural network, usually nh and nw to height and width will decrease. Pointed this out earlier, but it goes from 32 x 32, to 20 x 20, to 14 x 14, to 10 x 10, to 5 x 5. So as you go deeper usually the height and width will decrease, whereas the number of channels will increase. It's gone from 3 to 6 to 16, and then your fully connected layer is at the end. And another pretty common pattern you see in neural networks is to have conv layers, maybe one or more conv layers followed by a pooling layer, and then one or more conv layers followed by pooling layer. And then at the end you have a few fully connected layers and then followed by maybe a softmax. And this is another pretty common pattern you see in neural networks. So let's just go through for this neural network some more details of what are the activation shape, the activation size, and the number of parameters in this network. So the input was 32 x 30 x 3, and if you multiply out those numbers you should get 3,072. So the activation, a0 has dimension 3072. Well it's really 32 x 32 x 3. And there are no parameters I guess at the input layer. And as you look at the different layers, feel free to work out the details yourself. These are the activation shape and the activation sizes of these different layers.
10:15
So just to point out a few things. First, notice that the max pooling layers don't have any parameters. Second, notice that the conv layers tend to have relatively few parameters, as we discussed in early videos. And in fact, a lot of the parameters tend to be in the fully collected layers of the neural network. And then you notice also that the activation size tends to maybe go down gradually as you go deeper in the neural network. If it drops too quickly, that's usually not great for performance as well. So it starts first there with 6,000 and 1,600, and then slowly falls into 84 until finally you have your Softmax output. You find that a lot of will have properties will have patterns similar to these. So you've now seen the basic building blocks of neural networks, your convolutional neural networks, the conv layer, the pooling layer, and the fully connected layer. A lot of computer division research has gone into figuring out how to put together these basic building blocks to build effective neural networks. And putting these things together actually requires quite a bit of insight. I think that one of the best ways for you to gain intuition is about how to put these things together is a C a number of concrete examples of how others have done it. So what I want to do next week is show you a few concrete examples even beyond this first one that you just saw on how people have successfully put these things together to build very effective neural networks. And through those videos next week l hope you hold your own intuitions about how these things are built. And as we are given concrete examples that architectures that maybe you can just use here exactly as developed by someone else or your own application. So we'll do that next week, but before wrapping this week's videos just one last thing which is one I'll talk a little bit in the next video about why you might want to use convolutions. Some benefits and advantages of using convolutions as well as how to put them all together. How to take a neural network like the one you just saw and actually train it on a training set to perform image recognition for some of the tasks. So with that let's go on to the last video of this week.

Why Convolutions? - 9m

0:00
For this final video for this week, let's talk a bit about why convolutions are so useful when you include them in your neural networks. And then finally, let's briefly talk about how to put this all together and how you could train a convolution neural network when you have a label training set. I think there are two main advantages of convolutional layers over just using fully connected layers. And the advantages are parameter sharing and sparsity of connections. Let me illustrate with an example. Let's say you have a 32 by 32 by 3 dimensional image, and this actually comes from the example from the previous video, but let's say you use five by five filter with six filters. And so, this gives you a 28 by 28 by 6 dimensional output. So, 32 by 32 by 3 is 3,072, and 28 by 28 by 6 if you multiply all those numbers is 4,704. And so, if you were to create a neural network with 3,072 units in one layer, and with 4,704 units in the next layer, and if you were to connect every one of these neurons, then the weight matrix, the number of parameters in a weight matrix would be 3,072 times 4,704 which is about 14 million. So, that's just a lot of parameters to train. And today you can train neural networks with even more parameters than 14 million, but considering that this is just a pretty small image, this is a lot of parameters to train. And of course, if this were to be 1,000 by 1,000 image, then your display matrix will just become invisibly large. But if you look at the number of parameters in this convolutional layer, each filter is five by five. So, each filter has 25 parameters, plus a bias parameter miss of 26 parameters per a filter, and you have six filters, so, the total number of parameters is that, which is equal to 156 parameters. And so, the number of parameters in this conv layer remains quite small. And the reason that a consonant has run to these small parameters is really two reasons. One is parameter sharing. And parameter sharing is motivated by the observation that feature detector such as vertical edge detector, that's useful in one part of the image is probably useful in another part of the image. And what that means is that, if you've figured out say a three by three filter for detecting vertical edges, you can then apply the same three by three filter over here, and then the next position over, and the next position over, and so on. And so, each of these feature detectors, each of these aqua's can use the same parameters in lots of different positions in your input image in order to detect say a vertical edge or some other feature. And I think this is true for low-level features like edges, as well as the higher level features, like maybe, detecting the eye that indicates a face or a cat or something there. But being with a share in this case the same nine parameters to compute all 16 of these aquas, is one of the ways the number of parameters is reduced. And it also just seems intuitive that a feature detector like a vertical edge detector computes it for the upper left-hand corner of the image. The same feature seems like it will probably be useful, has a good chance of being useful for the lower right-hand corner of the image. So, maybe you don't need to learn separate feature detectors for the upper left and the lower right-hand corners of the image. And maybe you do have a dataset where you have the upper left-hand corner and lower right-hand corner have different distributions, so, they maybe look a little bit different but they might be similar enough, they're sharing feature detectors all across the image, works just fine. The second way that consonants get away with having relatively few parameters is by having sparse connections. So, here's what I mean, if you look at the zero, this is computed via three by three convolution. And so, it depends only on this three by three inputs grid or cells. So, it is as if this output units on the right is connected only to nine out of these six by six, 36 input features. And in particular, the rest of these pixel values, all of these pixel values do not have any effects on the other output. So, that's what I mean by sparsity of connections. As another example, this output depends only on these nine input features. And so, it's as if only those nine input features are connected to this output, and the other pixels just don't affect this output at all. And so, through these two mechanisms, a neural network has a lot fewer parameters which allows it to be trained with smaller training cells and is less prone to be over 30. And so, sometimes you also hear about convolutional neural networks being very good at capturing translation invariance. And that's the observation that a picture of a cat shifted a couple of pixels to the right, is still pretty clearly a cat. And convolutional structure helps the neural network encode the fact that an image shifted a few pixels should result in pretty similar features and should probably be assigned the same oval label. And the fact that you are applying to same filter, knows all the positions of the image, both in the early layers and in the late layers that helps a neural network automatically learn to be more robust or to better capture the desirable property of translation invariance. So, these are maybe a couple of the reasons why convolutions or convolutional neural network work so well in computer vision. Finally, let's put it all together and see how you can train one of these networks. Let's say you want to build a cat detector and you have a labeled training sets as follows, where now, X is an image. And the y's can be binary labels, or one of K causes. And let's say you've chosen a convolutional neural network structure, may be inserted the image and then having neural convolutional and pulling layers and then some fully connected layers followed by a software output that then operates Y hat. The conv layers and the fully connected layers will have various parameters, W, as well as bias's B. And so, any setting of the parameters, therefore, lets you define a cost function similar to what we have seen in the previous courses, where we've randomly initialized parameters W and B. You can compute the cause J, as the sum of losses of the neural networks predictions on your entire training set, maybe divide it by M. So, to train this neural network, all you need to do is then use gradient descents or some of the algorithm like, gradient descent momentum, or RMSProp or Adam, or something else, in order to optimize all the parameters of the neural network to try to reduce the cost function J. And you find that if you do this, you can build a very effective cat detector or some other detector. So, congratulations on finishing this week's videos. You've now seen all the basic building blocks of a convolutional neural network, and how to put them together into an effective image recognition system. In this week's program exercises, I think all of these things will come more concrete, and you'll get the chance to practice implementing these things yourself and seeing it work for yourself. Next week, we'll continue to go deeper into convolutional neural networks. I mentioned earlier, that there're just a lot of the hyperparameters in convolution neural networks. So, what I want to do next week, is show you a few concrete examples of some of the most effective convolutional neural networks, so you can start to recognize the patterns of what types of network architectures are effective. And one thing that people often do is just take the architecture that someone else has found and published in a research paper and just use that for your application. And so, by seeing some more concrete examples next week, you also learn how to do that better. And beyond that, next week, we'll also just get that intuitions about what makes confinet work well, and then in the rest of the course, we'll also see a variety of other computer vision applications such as, object detection, and neural store transfer. How they create new forms of artwork using these set of algorithms. So, that's over this week, best of luck with the home works, and I look forward to seeing you next week.

(Optional) Heroes of Deep Learning - Yann LeCun Interview - 27m

0:03
Hi Yann, you've been such a leader for Deep Learning for so long, thanks a lot for doing this with us. >> Well, thanks for having me. >> So, you've been working on neural nets for a long time. I would love to hear your personal story of how you got started in AI, how did you networking with neural networks? >> So, I was always interested in intelligence, in general, the origins of intelligence in humans. Got me interested into human evolution when I was a kid. >> That was in France? >> It was in France, yeah. I was in middle school or something and I was interested in technology, space, etc. My favorite movie was 2001: A Space Odyssey. You had intelligent machines, space travel, and human evolution as kind of the themes that was what I was fascinated by. And the concept of intelligent machines I think really kind of appealed to me. And then I studied electrical engineering. And when I was at school, I was maybe in second year of engineering school, I stumbled on a book, which was actually a philosophy book. It was a debate between Noam Chomsky, the computational linguist at MIT, and Jean Piaget who is a cognitive psychologist sort of psychology of child development in Switzerland. And it was basically a debate between nature and nurture, where Chomsky arguing for the fact that language has a lot of innate structure, and Piaget saying a lot of it is learned, and etc. And on the side of Piaget was a transcription of a person who, each of these guys sort of brought their teams of people to argue for their side. And on the side of Piaget was Seymour Papert from MIT, who had worked on the perceptron model, one of the first machines capable of running. And I never heard of the perceptron, and I read this article that say, machine capable of running, that sounds wonderful. And so I started going to several university libraries and searching for everything I could find that talked about the perceptron and realized there was a lot of papers from the 50s, but it kind of stopped at the end of the 60s, with a book that was co-authored by the same Seymour Papert. >> What year was this? >> So this was in 1980, roughly? >> Right. >> And so I did a couple of projects with some of the math professor in my school on kind of neural nets, essentially. But there was no one there I could talk to who had worked on this, because the field basically had disappeared in the meantime, right? Since 1980, nobody was working on this.
2:42
And experimented with this a little bit, writing kind of simulation software of various kinds, reading about neuroscience.
2:52
When I finished my engineering studies, I studied chip design. I'm good at site design at the time, so it's something completely different. And when I finished I really wanted to do research on this and I figured out already that at the time the important question was how you train neural nets with multiple layers. It was pretty clear in the literature of the 60s that that was the important question that had been left unsolved and their idea of hierarchy and everything. I'd read Fukushima's article on the neocognitron, right? Which was this sort of hierarchical architecture very similar to now what we now call convolutional nets, but without really backprop style learning algorithms. And I met people who were in a small independent club in France. They were interested in what they called at the time, Automata Networks. And they gave me a couple papers, the people on functional networks which is not very popular anymore. But it's the first associative memories with neural net and that paper can revive the interest of some research committees into neural net in the early 80s. Where by mostly physicists and condense matter physicists and a few psychologists, it was still not okay for engineers and computer scientists to talk about neural nets. And they also should be another paper that had just been distributed as a pre-print, whose title was Optimal Perceptual Inference. And this was the first paper on Boltzmann machines by Geoff Hinton and Terry Sejnowski. It was talking about hidden units. It was talking about, basically, the part of learning, multilayer neural nets are more capable than just classifiers. So I said, I need to meet these people [LAUGH]. >> Wow. >> Because they're only interested in the right problem.
4:54
And a couple of years later, after I started my PhD, I participated in a workshop in Le Juch that was organized by the people I was working with. And Terry was one of the speakers at the workshop, so I met him at that time. >> It was like early 80s now. >> This is 1985, early 1985. So I met Terry Sejnowski in 1985 in the workshop in France in Le Juch and a lot of people were there, founders of early neural net, jump up field and, a lot of people working on theoretical neuroscience and stuff like that. It was a fascinating workshop. I met also, a couple of people from Bell Labs who eventually hired me at Bell Labs, but this was several years before I finished my PhD. So I talked to Terry Sejnowski and I was telling him about what I was working on which was some version of backprop at the time. This is before backprop was a paper and Terry was working on net talk at the time.
5:55
This was before the Rumelhart, Hinton, Williams paper on backprop had been published. But he was friends with Geoff, this information was circulating, so he was already working on trying to make this work for net talk, but he didn't tell me. >> I see. >> And he went back to US and told Geoff there is some kid in France who's working on the same stuff we're working on. >> I see. >> [LAUGH] And then a few months later, in June, there was another conference in France where Geoff was a keynote speaker.
6:26
And he gave a talk on Boltzmann machines. Of course, he was working on the backprop paper.
6:32
And he gave this talk, and then there was 50 people around him who wanted to talk to him. And the first thing he said to the organizer is, do you know this guy, Yann LeCun? And it's because he had read my paper in the proceedings that was written in French. And he could sort of read French and he could see the math and he could figure out what sort of backprop, and so we had lunch together and that's how we became friends. >> I see, well. >> [LAUGH] >> So that's because multiple groups independently reinvented or invented backprop pretty much. >> Right, well, we realized that the whole idea with Chain Rule or what the optimal control people call the joint state method which is really the context in which backprop was really invented. This in context of optimal control back in the early 60s. This idea that you could use graded descent basically with multiple stages is what backprop really is and that popped up in various contexts at various times. And but I think the Rumelhart, Hinton, Williams paper is the one that popularized it. >> I see, yeah, no, cool, yeah. And then fast forward a few years, you wound up at AT&T Bell Labs, where you invented, among many things, the net, which we talk about in the course. And I remember when way back, I was a summer intern at AT&T Bell Labs, where I worked with Michael Kerns and a few others, and of hearing about your work even back then. So tell me more about your AT&T, the net, experience. >> Okay, so what happened is, I actually started working on convolutional net when I was A postdoc, University of Toronto, chief intern.
8:07
I did the first experiment, I wrote the code there, and I did the first experiments there that showed that, if you had a very small data set. The data set I was training on, there was no or anything like that back then. So I drew a bunch of characters with my mouse. I had an Amiga, a personal computer, which was the best computer ever. And I drew a bunch of characters and then used that. I did augmentation to kind of increase it, and then used that as a way to test performance. And I compared things like fully connected nets, locally connected nets without shared weights. And then shared weight networks. Which was basically the first comment. And that worked really well for relatively small data sets, could show that you get better performance and no over-training with conventional architecture. And when I go to Bell Labs in October 1988, the first thing I did was first, scale up the network, because we had faster computers a few months before I go to Bell Labs. My boss at the time, Larry Jackal, who became a department head of said we should order a computer for you before you come. Where do you want? I say well, here Toronto, there is which was the stuff. It'd be great if we had one. And they ordered one and I had one for myself. At University of Toronto it was one for the entire department, right? One just for me, right? And so Larry told me he said, you know at Bell Labs you don't get famous by saving money. >> [LAUGH] >> So that was awesome, and they had been working already for awhile on character recognition. They had this enormous data set called USDS that had 5,000 training samples. [LAUGH] And so immediately I trained a net, which was in the net one, basically. And trained it on this data set and got really good results, better results than the other methods. They had tried on it, and that other people had tried on it is that so that, we knew we had something fairly early on. This was within three months of me joining Bell Labs. And so that was the first version of commercial net where we had a convolution with stride, and we did not have separate and pulling layers. >> Mm-hm. >> So each convolution was actually directly. And the reason for this is that we just could not afford to have a convolution at every location. There was just too much computation. >> I see. >> [COUGH] So, the second version had a separate convolution and pulling the air in something.
10:43
I guess that's the one that's called one really. So we published a couple papers on this at competitions in Nips. And so, interesting story, did you ever talk to Nips about this work?
10:58
And Jeffrey Ton was in the audience, and then you know I came back to my seat, I was sitting next to him and he said, there's one bit of information in your talk which is that, if you do all the sensible things, it actually works. >> [LAUGH] >> Then that showed the after deadline of work went on to make history because it became widely adopted. These ideas became widely adopted for reading cheques and- >> Yeah, the bigger value adopted within AT&T but not very much outside. And I think it's a little difficult for me to really understand why, but the simple factor [INAUDIBLE]. So this was back in the late 80s, and there was no Internet. We had email, we had FTP, but there was no Internet, really. No two labs were using the same software or hardware platform, right? Some people are at some workstations, others had other machines, some people were using PCs or whatever. There was no such thing as Python or MATLAB or anything like that, right? People are writing their own code. I had spent a year and a half basically writing, me and when he was still a student. We're working together, and we spent a year and a half basically just writing a neural net simulator.
12:12
And at the time because there was no match-up with Python. You had to kind of write your own interpreter, right? To kind of control it. So we want our own list of interpreter. And so all the networks written in list using a numerical back hand. Very similar to what we have now with blocks that you can interconnect and instead of many differentiation and all that stuff that we;re familiar now, with torsion by torsion, tensile flow and all those things.
12:37
So then we developed a bunch of applications. We got together with a group of engineers.
12:46
Very smart people.
12:48
Some of them were like theoretical physicists who kind of turned engineer at the Bell Labs.
12:57
Chris Dodgers was one of them who then had to
13:01
distinguished career at Microsoft research afterwards. And Krieg Nolan. But keep on and we're collaborating with them to kind of make this technology practical. >> I see. >> And so together we developed this characterization systems. And that meant integrating, convolutional net with things like, similar to things like we now call CRFs for interpreting sequences of characters not just individual address. >> Yeah, right to the net paper had partially under neural network and partially under atomic machinery >> Right, to pull it together? >> Yeah, that's right. And so the first half on the paper is on convolutional nets, and the paper is mostly cited for that. And then the second half, very few people have read it, [LAUGH] and it's about sort of sequence level, discriminative running, and basically structure prediction with that normalization. So it's very similar to CRF, in fact. >> Fascinating >> You know with PTCRFS over the years. So that was very successful, except that the day we were
14:08
celebrating the deployment of that system in major bank,
14:13
we worked with this group that I was mentioning that was kind of doing the engineering of the whole system. And then another product group in a different part of the country that belonged to a subsidiary of AT&T called NCR. So this is the- >> [CROSSTALK] >> National Cash Register, right. They also build large ATM machines, and they build large check reading machines for banks. So they were the customers, if you want. They were using our check billing systems. And they had deployed it in a bank. I can't remember which bank it was. They deployed those, so there were ATM machines in a French book. So they could read the check you would deposit, and we were all at a fancy restaurant celebrating the department of this thing where, when the company announced that it was breaking itself up. So this was 1995 and AT&T announced that it was breaking itself up into two companies. So there was AT&T, and then there was Lucen Technologies, and NCR. So NCR was spun off, and Lucent Technologies was spun off. And the engineering group went with Lucent Technologies, and the product group, of course, went with NCR.
15:19
And the sad thing is that the AT&T lawyers in their infinite wisdom assigned the patents, there was a patent on covolutional net which is thankfully expired. >> I see [LAUGH]. >> [LAUGH] Expired in 2007. About ten years ago. And they signed patent to NCR, but there was nobody in NCR who actually knew even what a convolutional net was really. And so the patent was in the hands of people who had no idea what they had. And we were in a different company that now could not really develop the technology, and our engineering team was in a separate company, because we went with AT&T and engineering went with Lucent, and the product group went with NCR. So it was a little depressing [LAUGH]. >> So in addition to your early work, when your networks were Part, you kept persisting on neural networks even when there was some sort of winter for neural net. So what was like that? >> Well, so I persisted and didn't persist in some ways. I was always convinced that eventually those techniques would come back to the fore, and sort of people would figure out how to use them in practice, and it would be useful. So I always had that in the back of my mind. But in 1996, when AT&T broke itself up, and all of our work on character recognition, basically, was kind of broken up because the part of the group went in separate way, I was also promoted to department head, and I had to figure out what to work on. And this was the early days of the Internet, we're talking 1995. And I had the idea somehow that one big problem about the emergence of the Internet was going to be to bring all the knowledge that we had on paper to the digital world. And so I started, actually, a project called DjVu, D-J-V-U, which was to compress scanned documents, essentially, so they could be distributed over the Internet. And this project was really fun for a while, and had some success, although AT&T really didn't know what to do with it. >> Yeah, I remember that, really helping dissemination of online research papers. >> Yeah, that's right, exactly. And we scanned the entire proceedings of NIPS, and we made them available online- >> I see, I remember that. >> To kind of demonstrate how that worked. And we could compress high resolution pages to just a few kilobytes. >> So ConvNet, starting from some of your much earlier work has now come and pretty much taken over the field of computer vision, and starting to encroach significantly into even other fields. So just tell me about how you saw that whole process. >> [LAUGH] So to tell you how I thought this was going to happen early on. So first of all, I always believed that this was going to work. It required fast computers and lots of data, but I always believed, somehow, that this was going to be the right thing to do. What I thought, originally, when I was at Bell Labs, that there was going to be some sort of continuous progress along these directions as machines got more powerful. And we were even designing chips to run convolutional nets at Bell Labs, but now those are actually in hospital graph separately had two different chips for running convolutional nets really efficiently. And so we thought there was going to be a kind of a pick up of this, and kind of growing interest and sort of continuous progress for it. But in fact, because of the sort of interest for neural nets, sort of dying in the mid-90s, that didn't happen. So there was kind of a dark period of six or seven years between 1995 roughly and 2002 when basically nobody was working on this. In fact, there was a little bit of work. There was some work at Microsoft in the early 2000s on using convolutional nets for Chinese character recognition.
19:08

Group, yeah, exactly. And there was some other small work for face detection and things like this in France, and in various other places, but it was very small. I discovered actually recently that a couple groups that came up with ideas that are essentially very similar to convolutional nets, but never quite published it the same way for medical image analysis. And those were mostly in the context of commercial systems. And so it never quite made it to the population. I mean, it was after our first work on convolutional nets, and they were not really aware of it, but it sort of developed in parallel a little bit. So several people got kind of similar ideas several years interval. But then I was really surprised by how fast interest picked up after the ImageNet- >> 2012 >> In 2012, so it's the end of 2012. It was kind of a very interesting event at ECCV, in Florence, where there was a workshop on ImageNet. And they already knew that had won by a large margin. And so everybody was waiting for talk. And most people in the computer vision community had no idea what a convolutional net was. I mean, they heard me talk about it. I actually had an invited talk at CVPR in 2000 where I talked about it, but most people had not paid much attention to it. Senior people did, they knew what it was, but the more junior people in the community were really, had no idea what it was. And so just gives his talk, and he doesn't explain what a convolutional net is because he assumes everybody knows, right? because he comes from a so he says, here's how everything is connected, and how we transform the data and what results we get. Again, assuming that everybody knows what it is. And a lot of people are incredibly surprised. And you could see the opinion of people changing as he was kind of giving his talk, very senior people in the field. >> So you think that workshop was a defining moment that swayed a lot of the computer vision community. >> Yeah, definitely. >> That's right, yeah. >> That's the way it happened, yeah, right there. >> So today, you retain a faculty position at NYU, and you also lead FAIR, Facebook AI Research. I know you have a pretty unique point of view on how corporate research should be done. Do you want to share your thoughts on that? >> Yeah, so I mean, one of the beautiful things that I managed to do at Facebook in the last four years is that I was given a lot of freedom to setup FAIR the way I thought was the most appropriate, because this was the first research organization within Facebook. Facebook is a sort of engineering-centric company. And so far was really focused on sort of survival or short-term things. And Facebook was about to turn ten years old, had a successful IPO. And was basically thinking about the next ten years, right? I mean, Mark Zuckerberg was thinking, what is going to be important for the next ten years? And the survival of the company was not in question anymore. So this is the kind of transition where a large company can start to think, or it was not such a large company at the time. Facebook had 5,000 employees or so, but it had the luxury to think about the next ten years and what would be important in technology. And Mark and his team decided that AI was going to be a crucial piece of technology for connecting people, which is the mission of Facebook. And so they explored several ways to kind of build an effort in AI. They had a small internal group, engineering group, experimenting with convolutional nets and stuff that were getting really good results in face recognition and various other things, which peaked their interest. And they explored the idea of hiring a bunch of young researchers, or acquiring a company, or things like this. And they settled on the idea of hiring someone senior in the field, and then kind of setting up a research organization.
23:20
And it was a bit of a culture shock, initially, because the way research operates in the company is very different from engineering, right? You have longer time scales and horizon. And researchers tend to be very conservative about the choice of places where they want to work. And I made it very clear very early on that research needs to be open, that researchers need to not only be encouraged to publish, but be even mandated to publish. And also be evaluated on criteria that are similar to what we used to evaluate academic researchers. [COUGH] And so what Mark and Mike Schroepfer, the CTO of the company, who is my boss now, said, they said, Facebook is a very open company. We distribute a lot of stuff in open source.
24:13
Schroepfer, the CTO, comes from the open source world. >> Mozilla. >> He was from Mozilla before that, and a lot of people came from that world. So that was in the DNA of the company, so that made me very confident that we could kind of set up an open research organization. And then the fact that the company is not obsessively compulsive about IP as some other companies are makes it much easier to collaborate with universities and have arrangements by which a person can have a foot in industry and a foot in academia. >> And you find that valuable, yourself? >> Absolutely, yes. Yeah, so if you look at my publications over the last four years, the vast majority of them are publications with my students at NYU. >> I see. >> Because at Facebook, I did a lot of organizing the lab, hiring, set the direction and advising, and things like this. But I don't get involved in individual research projects to get my name on papers. And I don't care to get my name on papers anymore, but it's- >> It's not sending out someone else to do your dirty work rather than doing all the dirty work yourself. >> Exactly, and you never want to put yourself, you want to stay behind the scene. You don't want to put yourself in competition with people in your lab in that case. >> I'm sure you get asked this a lot but hoping you answer for all the people watching this video as well.
25:36
What advice do you have for someone wanting to get involved in the, break into AI? >> [LAUGH] I mean, it's such a different world now than when it was when I got started. But I think what's great now is it's very easy for people to get involved at some level, the tools that are available are so easy to use now, in terms of whatever. You can have a run through on the cheap computer in your bedroom, [LAUGH] and basically train your conventional net or your current net to do whatever, and there's a lot of tools. You can learn a lot from online material about this without, it's not very onerous. So you see high school students now playing with this right? Which is kind of great, I think and they certainly are growing interest from the student population to learn about machine learning and AI and it's very exciting for young people and I find that wonderful I think. So my advice is, if you want to get into this, make yourself useful. So make a contribution to an open source project, for example. Or make an implementation of some standard algorithm that you can't find the code of online, but you'd like to make it available to other people. So take a paper that you think is important, and then re-implement the algorithm, and then put it open source package, or contribute to one of those open source packages. And if the stuff you write is interesting and useful, you'll get noticed. Maybe you'll get a nice job at a company you really wanted a job at, or maybe you'll get accepted in your favorite PhD program or things like this. So I think that's a good way to get started. >> So open source contributions is a good way to enter the community, give back to learn. >> Yeah, that's right, that's right. >> Thanks a lot Jan that was fascinating, I've known you for many years and it's still fascinating to hear all these details of all the stories that have gone in over the years. >> Yeah, there's many stories like this that, reflecting back at the moment when they happen you don't realize, what importance it might take 10 or 20 years later. >> Yeah, thank you. >> Thanks.

Deep convolutional models: case studies

Learn about the practical tricks and methods used in deep CNNs straight from the research papers.

Why look at case studies? - 3m

Hello and welcome back. This week the first thing we'll do is show you a number of case studies of effective convolutional neural networks. So why look at case studies? Last week we learned about the basic building blocks such as convolutional layers, pooling layers and fully connected layers of conv nets. It turns out a lot of the past few years of computer vision research has been on how to put together these basic building blocks to form effective convolutional neural networks. And one of the best ways for you to get intuition yourself is to see some of these examples. I think just as many of you may have learned to write codes by reading other people's codes, I think that a good way to get intuition on how to build conv nets is to read or to see other examples of effective conv nets. And it turns out that a neural network architecture that works well on one computer vision task often works well on other tasks as well such as maybe on your task. So if someone else is training neural network as speak it out in your network architecture is very good at recognizing cats and dogs and people but you have a different computer vision task like maybe you're trying to build a self-driving car. You might well be able to take someone else's neural network architecture and apply that to your problem. And finally, after the next few videos, you'll be able to read some of the research papers from the theater computer vision and I hope that you might find it satisfying as well. You don't have to do this as a class but I hope you might find it satisfying to be able to read some of these seminal computer vision research paper and see yourself able to understand them. So with that, let's get started. As an outline of what we'll do in the next few videos, we'll first show you a few classic networks. The LeNet-5 network which came from, I guess, in 1980s, AlexNet which is often cited and the VGG network and these are examples of pretty effective neural networks. And some of the ideas lay the foundation for modern computer vision. And you see ideas in these papers that are probably useful for your own. And you see ideas from these papers that were probably be useful for your own work as well. Then I want to show you the ResNet or conv residual network and you might have heard that neural networks are getting deeper and deeper. The ResNet neural network trained a very, very deep 152-layer neural network that has some very interesting tricks, interesting ideas how to do that effectively. And then finally you also see a case study of the Inception neural network. After seeing these neural networks, l think you have much better intuition about how to built effective convolutional neural networks. And even if you end up not working computer vision yourself, I think you find a lot of the ideas from some of these examples, such as ResNet Inception network, many of these ideas are cross-fertilizing on making their way into other disciplines. So even if you don't end up building computer vision applications yourself, I think you'll find some of these ideas very interesting and helpful for your work.

Classic Networks - 18m

0:00
In this video, you'll learn about some of the classic neural network architecture starting with LeNet-5, and then AlexNet, and then VGGNet. Let's take a look. Here is the LeNet-5 architecture. You start off with an image which say, 32 by 32 by 1. And the goal of LeNet-5 was to recognize handwritten digits, so maybe an image of a digits like that. And LeNet-5 was trained on grayscale images, which is why it's 32 by 32 by 1. This neural network architecture is actually quite similar to the last example you saw last week. In the first step, you use a set of six, 5 by 5 filters with a stride of one because you use six filters you end up with a 20 by 20 by 6 over there. And with a stride of one and no padding, the image dimensions reduces from 32 by 32 down to 28 by 28. Then the LeNet neural network applies pooling. And back then when this paper was written, people use average pooling much more. If you're building a modern variant, you probably use max pooling instead. But in this example, you average pool and with a filter width two and a stride of two, you wind up reducing the dimensions, the height and width by a factor of two, so we now end up with a 14 by 14 by 6 volume. I guess the height and width of these volumes aren't entirely drawn to scale. Now technically, if I were drawing these volumes to scale, the height and width would be stronger by a factor of two. Next, you apply another convolutional layer. This time you use a set of 16 filters, the 5 by 5, so you end up with 16 channels to the next volume. And back when this paper was written in 1998, people didn't really use padding or you always using valid convolutions, which is why every time you apply convolutional layer, they heightened with strengths. So that's why, here, you go from 14 to 14 down to 10 by 10. Then another pooling layer, so that reduces the height and width by a factor of two, then you end up with 5 by 5 over here. And if you multiply all these numbers 5 by 5 by 16, this multiplies up to 400. That's 25 times 16 is 400. And the next layer is then a fully connected layer that fully connects each of these 400 nodes with every one of 120 neurons, so there's a fully connected layer. And sometimes, that would draw out exclusively a layer with 400 nodes, I'm skipping that here. There's a fully connected layer and then another a fully connected layer. And then the final step is it uses these essentially 84 features and uses it with one final output. I guess you could draw one more node here to make a prediction for ŷ. And ŷ took on 10 possible values corresponding to recognising each of the digits from 0 to 9. A modern version of this neural network, we'll use a softmax layer with a 10 way classification output. Although back then, LeNet-5 actually use a different classifier at the output layer, one that's useless today. So this neural network was small by modern standards, had about 60,000 parameters. And today, you often see neural networks with anywhere from 10 million to 100 million parameters, and it's not unusual to see networks that are literally about a thousand times bigger than this network. But one thing you do see is that as you go deeper in a network, so as you go from left to right, the height and width tend to go down. So you went from 32 by 32, to 28 to 14, to 10 to 5, whereas the number of channels does increase. It goes from 1 to 6 to 16 as you go deeper into the layers of the network. One other pattern you see in this neural network that's still often repeated today is that you might have some one or more conu layers followed by pooling layer, and then one or sometimes more than one conu layer followed by a pooling layer, and then some fully connected layers and then the outputs. So this type of arrangement of layers is quite common. Now finally, this is maybe only for those of you that want to try reading the paper. There are a couple other things that were different. The rest of this slide, I'm going to make a few more advanced comments, only for those of you that want to try to read this classic paper. And so, everything I'm going to write in red, you can safely skip on the slide, and there's maybe an interesting historical footnote that is okay if you don't follow fully. So it turns out that if you read the original paper, back then, people used sigmoid and tanh nonlinearities, and people weren't using value nonlinearities back then. So if you look at the paper, you see sigmoid and tanh referred to. And there are also some funny ways about this network was wired that is funny by modern standards. So for example, you've seen how if you have a nh by nw by nc network with nc channels then you use f by f by nc dimensional filter, where everything looks at every one of these channels. But back then, computers were much slower. And so to save on computation as well as some parameters, the original LeNet-5 had some crazy complicated way where different filters would look at different channels of the input block. And so the paper talks about those details, but the more modern implementation wouldn't have that type of complexity these days. And then one last thing that was done back then I guess but isn't really done right now is that the original LeNet-5 had a non-linearity after pooling, and I think it actually uses sigmoid non-linearity after the pooling layer. So if you do read this paper, and this is one of the harder ones to read than the ones we'll go over in the next few videos, the next one might be an easy one to start with. Most of the ideas on the slide I just tried in sections two and three of the paper, and later sections of the paper talked about some other ideas. It talked about something called the graph transformer network, which isn't widely used today. So if you do try to read this paper, I recommend focusing really on section two which talks about this architecture, and maybe take a quick look at section three which has a bunch of experiments and results, which is pretty interesting. The second example of a neural network I want to show you is AlexNet, named after Alex Krizhevsky, who was the first author of the paper describing this work. The other author's were Ilya Sutskever and Geoffrey Hinton. So, AlexNet input starts with 227 by 227 by 3 images. And if you read the paper, the paper refers to 224 by 224 by 3 images. But if you look at the numbers, I think that the numbers make sense only of actually 227 by 227. And then the first layer applies a set of 96 11 by 11 filters with a stride of four. And because it uses a large stride of four, the dimensions shrinks to 55 by 55. So roughly, going down by a factor of 4 because of a large stride. And then it applies max pooling with a 3 by 3 filter. So f equals three and a stride of two. So this reduces the volume to 27 by 27 by 96, and then it performs a 5 by 5 same convolution, same padding, so you end up with 27 by 27 by 276. Max pooling again, this reduces the height and width to 13. And then another same convolution, so same padding. So it's 13 by 13 by now 384 filters. And then 3 by 3, same convolution again, gives you that. Then 3 by 3, same convolution, gives you that. Then max pool, brings it down to 6 by 6 by 256. If you multiply all these numbers,6 times 6 times 256, that's 9216. So we're going to unroll this into 9216 nodes. And then finally, it has a few fully connected layers. And then finally, it uses a softmax to output which one of 1000 causes the object could be. So this neural network actually had a lot of similarities to LeNet, but it was much bigger. So whereas the LeNet-5 from previous slide had about 60,000 parameters, this AlexNet that had about 60 million parameters. And the fact that they could take pretty similar basic building blocks that have a lot more hidden units and training on a lot more data, they trained on the image that dataset that allowed it to have a just remarkable performance. Another aspect of this architecture that made it much better than LeNet was using the value activation function. And then again, just if you read the bay paper some more advanced details that you don't really need to worry about if you don't read the paper, one is that, when this paper was written, GPUs was still a little bit slower, so it had a complicated way of training on two GPUs. And the basic idea was that, a lot of these layers was actually split across two different GPUs and there was a thoughtful way for when the two GPUs would communicate with each other. And the paper also, the original AlexNet architecture also had another set of a layer called a Local Response Normalization. And this type of layer isn't really used much, which is why I didn't talk about it. But the basic idea of Local Response Normalization is, if you look at one of these blocks, one of these volumes that we have on top, let's say for the sake of argument, this one, 13 by 13 by 256, what Local Response Normalization, (LRN) does, is you look at one position. So one position height and width, and look down this across all the channels, look at all 256 numbers and normalize them. And the motivation for this Local Response Normalization was that for each position in this 13 by 13 image, maybe you don't want too many neurons with a very high activation. But subsequently, many researchers have found that this doesn't help that much so this is one of those ideas I guess I'm drawing in red because it's less important for you to understand this one. And in practice, I don't really use local response normalizations really in the networks language trained today. So if you are interested in the history of deep learning, I think even before AlexNet, deep learning was starting to gain traction in speech recognition and a few other areas, but it was really just paper that convinced a lot of the computer vision community to take a serious look at deep learning to convince them that deep learning really works in computer vision. And then it grew on to have a huge impact not just in computer vision but beyond computer vision as well. And if you want to try reading some of these papers yourself and you really don't have to for this course, but if you want to try reading some of these papers, this one is one of the easier ones to read so this might be a good one to take a look at. So whereas AlexNet had a relatively complicated architecture, there's just a lot of hyperparameters, right? Where you have all these numbers that Alex Krizhevsky and his co-authors had to come up with. Let me show you a third and final example on this video called the VGG or VGG-16 network. And a remarkable thing about the VGG-16 net is that they said, instead of having so many hyperparameters, let's use a much simpler network where you focus on just having conv-layers that are just three-by-three filters with a stride of one and always use same padding. And make all your max pulling layers two-by-two with a stride of two. And so, one very nice thing about the VGG network was it really simplified this neural network architectures. So, let's go through the architecture. So, you solve up with an image for them and then the first two layers are convolutions, which are therefore these three-by-three filters. And in the first two layers use 64 filters. You end up with a 224 by 224 because using same convolutions and then with 64 channels. So because VGG-16 is a relatively deep network, am going to not draw all the volumes here. So what this little picture denotes is what we would previously have drawn as this 224 by 224 by 3. And then a convolution that results in I guess a 224 by 224 by 64 is to be drawn as a deeper volume, and then another layer that results in 224 by 224 by 64. So this conv64 times two represents that you're doing two conv-layers with 64 filters. And as I mentioned earlier, the filters are always three-by-three with a stride of one and they are always same convolutions. So rather than drawing all these volumes, am just going to use text to represent this network. Next, then uses are pulling layer, so the pulling layer will reduce. I think it goes from 224 by 224 down to what? Right. Goes to 112 by 112 by 64. And then it has a couple more conv-layers. So this means it has 128 filters and because these are the same convolutions, let's see what is the new dimension. Right? It will be 112 by 112 by 128 and then pulling layer so you can figure out what's the new dimension of that. And now, three conv-layers with 256 filters to the pulling layer and then a few more conv-layers, pulling layer, more conv-layers, pulling layer. And then it takes this final 7 by 7 by 512 these in to fully connected layer, fully connected with four thousand ninety six units and then a softmax output one of a thousand classes. By the way, the 16 in the VGG-16 refers to the fact that this has 16 layers that have weights. And this is a pretty large network, this network has a total of about 138 million parameters. And that's pretty large even by modern standards. But the simplicity of the VGG-16 architecture made it quite appealing. You can tell his architecture is really quite uniform. There is a few conv-layers followed by a pulling layer, which reduces the height and width, right? So the pulling layers reduce the height and width. You have a few of them here. But then also, if you look at the number of filters in the conv-layers, here you have 64 filters and then you double to 128 double to 256 doubles to 512. And then I guess the authors thought 512 was big enough and did double on the game here. But this sort of roughly doubling on every step, or doubling through every stack of conv-layers was another simple principle used to design the architecture of this network. And so I think the relative uniformity of this architecture made it quite attractive to researchers. The main downside was that it was a pretty large network in terms of the number of parameters you had to train. And if you read the literature, you sometimes see people talk about the VGG-19, that is an even bigger version of this network. And you could see the details in the paper cited at the bottom by Karen Simonyan and Andrew Zisserman. But because VGG-16 does almost as well as VGG-19. A lot of people will use VGG-16. But the thing I liked most about this was that, this made this pattern of how, as you go deeper and height and width goes down, it just goes down by a factor of two each time for the pulling layers whereas the number of channels increases. And here roughly goes up by a factor of two every time you have a new set of conv-layers. So by making the rate at which it goes down and that go up very systematic, I thought this paper was very attractive from that perspective. So that's it for the three classic architecture's. If you want, you should really now read some of these papers. I recommend starting with the AlexNet paper followed by the VGG net paper and then the LeNet paper is a bit harder to read but it is a good classic once you go over that. But next, let's go beyond these classic networks and look at some even more advanced, even more powerful neural network architectures. Let's go onto the next video.

ResNets - 7m

0:00
Very, very deep neural networks are difficult to train because of vanishing and exploding gradient types of problems. In this video, you'll learn about skip connections which allows you to take the activation from one layer and suddenly feed it to another layer even much deeper in the neural network. And using that, you'll build ResNet which enables you to train very, very deep networks. Sometimes even networks of over 100 layers. Let's take a look. ResNets are built out of something called a residual block, let's first describe what that is. Here are two layers of a neural network where you start off with some activations in layer a[l], then goes a[l+1] and then deactivation two layers later is a[l+2]. So let's to through the steps in this computation you have a[l], and then the first thing you do is you apply this linear operator to it, which is governed by this equation. So you go from a[l] to compute z[l +1] by multiplying by the weight matrix and adding that bias vector. After that, you apply the ReLU nonlinearity, to get a[l+1]. And that's governed by this equation where a[l+1] is g(z[l+1]). Then in the next layer, you apply this linear step again, so is governed by that equation. So this is quite similar to this equation we saw on the left. And then finally, you apply another ReLU operation which is now governed by that equation where G here would be the ReLU nonlinearity. And this gives you a[l+2]. So in other words, for information from a[l] to flow to a[l+2], it needs to go through all of these steps which I'm going to call the main path of this set of layers. In a residual net, we're going to make a change to this. We're going to take a[l], and just first forward it, copy it, match further into the neural network to here, and just at a[l], before applying to non-linearity, the ReLU non-linearity. And I'm going to call this the shortcut. So rather than needing to follow the main path, the information from a[l] can now follow a shortcut to go much deeper into the neural network. And what that means is that this last equation goes away and we instead have that the output a[l+2] is the ReLU non-linearity g applied to z[l+2] as before, but now plus a[l]. So, the addition of this a[l] here, it makes this a residual block. And in pictures, you can also modify this picture on top by drawing this picture shortcut to go here. And we are going to draw it as it going into this second layer here because the short cut is actually added before the ReLU non-linearity. So each of these nodes here, whwre there applies a linear function and a ReLU. So a[l] is being injected after the linear part but before the ReLU part. And sometimes instead of a term short cut, you also hear the term skip connection, and that refers to a[l] just skipping over a layer or kind of skipping over almost two layers in order to process information deeper into the neural network. So, what the inventors of ResNet, so that'll will be Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. What they found was that using residual blocks allows you to train much deeper neural networks. And the way you build a ResNet is by taking many of these residual blocks, blocks like these, and stacking them together to form a deep network. So, let's look at this network. This is not the residual network, this is called as a plain network. This is the terminology of the ResNet paper. To turn this into a ResNet, what you do is you add all those skip connections although those short like a connections like so. So every two layers ends up with that additional change that we saw on the previous slide to turn each of these into residual block. So this picture shows five residual blocks stacked together, and this is a residual network. And it turns out that if you use your standard optimization algorithm such as a gradient descent or one of the fancier optimization algorithms to the train or plain network. So without all the extra residual, without all the extra short cuts or skip connections I just drew in. Empirically, you find that as you increase the number of layers, the training error will tend to decrease after a while but then they'll tend to go back up. And in theory as you make a neural network deeper, it should only do better and better on the training set. Right. So, the theory, in theory, having a deeper network should only help. But in practice or in reality, having a plain network, so no ResNet, having a plain network that is very deep means that all your optimization algorithm just has a much harder time training. And so, in reality, your training error gets worse if you pick a network that's too deep. But what happens with ResNet is that even as the number of layers gets deeper, you can have the performance of the training error kind of keep on going down. Even if we train a network with over a hundred layers. And then now some people experimenting with networks of over a thousand layers although I don't see that it used much in practice yet. But by taking these activations be it X of these intermediate activations and allowing it to go much deeper in the neural network, this really helps with the vanishing and exploding gradient problems and allows you to train much deeper neural networks without really appreciable loss in performance, and maybe at some point, this will plateau, this will flatten out, and it doesn't help that much deeper and deeper networks. But ResNet is not even effective at helping train very deep networks. So you've now gotten an overview of how ResNets work. And in fact, in this week's programming exercise, you get to implement these ideas and see it work for yourself. But next, I want to share with you better intuition or even more intuition about why ResNets work so well, let's go onto the next video.

Why ResNets Work - 9m

0:00
So, why do ResNets work so well? Let's go through one example that illustrates why ResNets work so well, at least in the sense of how you can make them deeper and deeper without really hurting your ability to at least get them to do well on the training set. And hopefully as you've understood from the third course in this sequence, doing well on the training set is usually a prerequisite to doing well on your hold up or on your depth or on your test sets. So, being able to at least train ResNet to do well on the training set is a good first step toward that. Let's look at an example. What we saw on the last video was that if you make a network deeper, it can hurt your ability to train the network to do well on the training set. And that's why sometimes you don't want a network that is too deep. But this is not true or at least is much less true when you training a ResNet. So let's go through an example. Let's say you have X feeding in to some big neural network and just outputs some activation a[l]. Let's say for this example that you are going to modify the neural network to make it a little bit deeper. So, use the same big NN, and this output's a[l], and we're going to add a couple extra layers to this network so let's add one layer there and another layer there. And just for output a[l+2]. Only let's make this a ResNet block, a residual block with that extra short cut. And for the sake our argument, let's say throughout this network we're using the value activation functions. So, all the activations are going to be greater than or equal to zero, with the possible exception of the input X. Right. Because the value activation output's numbers that are either zero or positive. Now, let's look at what's a[l+2] will be. To copy the expression from the previous video, a[l+2] will be value apply to z[l+2], and then plus a[l] where is this addition of a[l] comes from the short circle from the skip connection that we just added. And if we expand this out, this is equal to g of w[l+2], times a of [l+1], plus b[l+2]. So that's z[l+2] is equal to that, plus a[l]. Now notice something, if you are using L two regularisation away to K, that will tend to shrink the value of w[l+2]. If you are applying way to K to B that will also shrink this although I guess in practice sometimes you do and sometimes you don't apply way to K to B, but W is really the key term to pay attention to here. And if w[l+2] is equal to zero. And let's say for the sake of argument that B is also equal to zero, then these terms go away because they're equal to zero, and then g of a[l], this is just equal to a[l] because we assumed we're using the value activation function. And so all of the activations are all negative and so, g of a[l] is the value applied to a non-negative quantity, so you just get back, a[l]. So, what this shows is that the identity function is easy for residual block to learn. And it's easy to get a[l+2] equals to a[l] because of this skip connection. And what that means is that adding these two layers in your neural network, it doesn't really hurt your neural network's ability to do as well as this simpler network without these two extra layers, because it's quite easy for it to learn the identity function to just copy a[l] to a[l+2] using despite the addition of these two layers. And this is why adding two extra layers, adding this residual block to somewhere in the middle or the end of this big neural network it doesn't hurt performance. But of course our goal is to not just not hurt performance, is to help performance and so you can imagine that if all of these heading units if they actually learned something useful then maybe you can do even better than learning the identity function. And what goes wrong in very deep plain nets in very deep network without this residual of the skip connections is that when you make the network deeper and deeper, it's actually very difficult for it to choose parameters that learn even the identity function which is why a lot of layers end up making your result worse rather than making your result better. And I think the main reason the residual network works is that it's so easy for these extra layers to learn the identity function that you're kind of guaranteed that it doesn't hurt performance and then a lot the time you maybe get lucky and then even helps performance. At least is easier to go from a decent baseline of not hurting performance and then great in decent can only improve the solution from there. So, one more detail in the residual network that's worth discussing which is through this edition here, we're assuming that z[l+2] and a[l] have the same dimension. And so what you see in ResNet is a lot of use of same convolutions so that the dimension of this is equal to the dimension I guess of this layer or the outputs layer. So that we can actually do this short circle connection, because the same convolution preserve dimensions, and so makes that easier for you to carry out this short circle and then carry out this addition of two equal dimension vectors. In case the input and output have different dimensions so for example, if this is a 128 dimensional and Z or therefore, a[l] is 256 dimensional as an example. What you would do is add an extra matrix and then call that Ws over here, and Ws in this example would be a[l] 256 by 128 dimensional matrix. So then Ws times a[l] becomes 256 dimensional and this addition is now between two 256 dimensional vectors and there are few things you could do with Ws, it could be a matrix of parameters we learned, it could be a fixed matrix that just implements zero paddings that takes a[l] and then zero pads it to be 256 dimensional and either of those versions I guess could work. So finally, let's take a look at ResNets on images. So these are images I got from the paper by Harlow. This is an example of a plain network and in which you input an image and then have a number of conv layers until eventually you have a softmax output at the end. To turn this into a ResNet, you add those extra skip connections. And I'll just mention a few details, there are a lot of three by three convolutions here and most of these are three by three same convolutions and that's why you're adding equal dimension feature vectors. So rather than a fully connected layer, these are actually convolutional layers but because the same convolutions, the dimensions are preserved and so the z[l+2] plus a[l] by addition makes sense. And similar to what you've seen in a lot of NetRes before, you have a bunch of convolutional layers and then there are occasionally pulling layers as well or pulling a pulling likely is. And whenever one of those things happen, then you need to make an adjustment to the dimension which we saw on the previous slide. You can do of the matrix Ws, and then as is common in these networks, you have pool, and then at the end you now have a fully connected layer that then makes a prediction using a softmax. So that's it for ResNet. Next, there's a very interesting idea behind using neural networks with one by one filters, one by one convolutions. So, one could use a one by one convolution. Let's take a look at the next video.

Networks in Networks and 1x1 Convolutions - 6m

0:00
In terms of designing content architectures, one of the ideas that really helps is using a one by one convolution. Now, you might be wondering, what does a one by one convolution do? Isn't that just multiplying by numbers? That seems like a funny thing to do. Turns out it's not quite like that. Let's take a look. So you'll see one by one filter, we'll put in number two there, and if you take the six by six image, six by six by one and convolve it with this one by one by one filter, you end up just taking the image and multiplying it by two. So, one, two, three ends up being two, four, six, and so on. And so, a convolution by a one by one filter, doesn't seem particularly useful. You just multiply it by some number. But that's the case of six by six by one channel images. If you have a 6 by 6 by 32 instead of by 1, then a convolution with a 1 by 1 filter can do something that makes much more sense. And in particular, what a one by one convolution will do is it will look at each of the 36 different positions here, and it will take the element wise product between 32 numbers on the left and 32 numbers in the filter. And then apply a ReLU non-linearity to it after that. So, to look at one of the 36 positions, maybe one slice through this value, you take these 36 numbers multiply it by 1 slice through the volume like that, and you end up with a single real number which then gets plotted in one of the outputs like that. And in fact, one way to think about the 32 numbers you have in this 1 by 1 by 32 filters is that, it's as if you have neuron that is taking as input, 32 numbers multiplying each of these 32 numbers in one slice of the same position heightened with by these 32 different channels, multiplying them by 32 weights and then applying a ReLU non-linearity to it and then outputting the corresponding thing over there. And more generally, if you have not just one filter, but if you have multiple filters, then it's as if you have not just one unit, but multiple units, taken as input all the numbers in one slice, and then building them up into an output of six by six by number of filters. So one way to think about the one by one convolution is that, it is basically having a fully connected neuron network, that applies to each of the 62 different positions. And what does fully connected neural network does? Is it puts 32 numbers and outputs number of filters outputs. So I guess the point on notation, this is really a nc(l+1), if that's the next layer. And by doing this at each of the 36 positions, each of the six by six positions, you end up with an output that is six by six by the number of filters. And this can carry out a pretty non-trivial computation on your input volume. And this idea is often called a one by one convolution but it's sometimes also called Network in Network, and is described in this paper, by Min Lin, Qiang Chen, and Schuicheng Yan. And even though the details of the architecture in this paper aren't used widely, this idea of a one by one convolution or this sometimes called Network in Network idea has been very influential, has influenced many other neural network architectures including the inception network which we'll see in the next video. But to give you an example of where one by one convolution is useful, here's something you could do with it. Let's say you have a 28 by 28 by 192 volume. If you want to shrink the height and width, you can use a pulling layer. So we know how to do that. But one of a number of channels has gotten too big and we want to shrink that. How do you shrink it to a 28 by 28 by 32 dimensional volume? Well, what you can do is use 32 filters that are one by one. And technically, each filter would be of dimension 1 by 1 by 192, because the number of channels in your filter has to match the number of channels in your input volume, but you use 32 filters and the output of this process will be a 28 by 28 by 32 volume. So this is a way to let you shrink nc as well, whereas pulling layers, I used just to shrink nH and nW, the height and width these volumes. And we'll see later how this idea of one by one convolutions allows you to shrink the number of channels and therefore, save on computation in some networks. But of course, if you want to keep the number of channels at 192, that's fine too. And the effect of the one by one convolution is it just adds non-linearity. It allows you to learn the more complex function of your network by adding another layer that inputs 28 by 28 by 192 and outputs 28 by 28 by 192. So, that's how a one by one convolutional layer is actually doing something pretty non-trivial and adds non-linearity to your neural network and allow you to decrease or keep the same or if you want, increase the number of channels in your volumes. Next, you'll see that this is actually very useful for building the inception network. Let's go on to that in the next video. So, you've now seen how a one by one convolution operation is actually doing a pretty non-trivial operation and it allows you to shrink the number of channels in your volumes or keep it the same or even increase it if you want. In the next video, you see that this can be used to help build up to the inception network. Let's go onto the next video.

Inception Network Motivation - 10m

0:00
When designing a layer for a ConvNet, you might have to pick, do you want a 1 by 3 filter, or 3 by 3, or 5 by 5, or do you want a pooling layer? What the inception network does is it says, why should you do them all? And this makes the network architecture more complicated, but it also works remarkably well. Let's see how this works. Let's say for the sake of example that you have inputted a 28 by 28 by 192 dimensional volume. So what the inception network or what an inception layer says is, instead choosing what filter size you want in a Conv layer, or even do you want a convolutional layer or a pooling layer? Let's do them all. So what if you can use a 1 by 1 convolution, and that will output a 28 by 28 by something. Let's say 28 by 28 by 64 output, and you just have a volume there. But maybe you also want to try a 3 by 3 and that might output a 20 by 20 by 128. And then what you do is just stack up this second volume next to the first volume. And to make the dimensions match up, let's make this a same convolution. So the output dimension is still 28 by 28, same as the input dimension in terms of height and width. But 28 by 28 by in this example 128. And maybe you might say well I want to hedge my bets. Maybe a 5 by 5 filter works better. So let's do that too and have that output a 28 by 28 by 32. And again you use the same convolution to keep the dimensions the same. And maybe you don't want to convolutional layer. Let's apply pooling, and that has some other output and let's stack that up as well. And here pooling outputs 28 by 28 by 32. Now in order to make all the dimensions match, you actually need to use padding for max pooling. So this is an unusual formal pooling because if you want the input to have a higher than 28 by 28 and have the output, you'll match the dimension everything else also by 28 by 28, then you need to use the same padding as well as a stride of one for pooling. So this detail might seem a bit funny to you now, but let's keep going. And we'll make this all work later. But with a inception module like this, you can input some volume and output. In this case I guess if you add up all these numbers, 32 plus 32 plus 128 plus 64, that's equal to 256. So you will have one inception module input 28 by 28 by 129, and output 28 by 28 by 256. And this is the heart of the inception network which is due to Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke and Andrew Rabinovich. And the basic idea is that instead of you needing to pick one of these filter sizes or pooling you want and committing to that, you can do them all and just concatenate all the outputs, and let the network learn whatever parameters it wants to use, whatever the combinations of these filter sizes it wants. Now it turns out that there is a problem with the inception layer as we've described it here, which is computational cost. On the next slide, let's figure out what's the computational cost of this 5 by 5 filter resulting in this block over here. So just focusing on the 5 by 5 pot on the previous slide, we had as input a 28 by 28 by 192 block, and you implement a 5 by 5 same convolution of 32 filters to output 28 by 28 by 32. On the previous slide I had drawn this as a thin purple slide. So I'm just going draw this as a more normal looking blue block here. So let's look at the computational costs of outputting this 20 by 20 by 32. So you have 32 filters because the outputs has 32 channels, and each filter is going to be 5 by 5 by 192. And so the output size is 20 by 20 by 32, and so you need to compute 28 by 28 by 32 numbers. And for each of them you need to do these many multiplications, right? 5 by 5 by 192. So the total number of multiplies you need is the number of multiplies you need to compute each of the output values times the number of output values you need to compute. And if you multiply all of these numbers, this is equal to 120 million. And so, while you can do 120 million multiplies on the modern computer, this is still a pretty expensive operation. On the next slide you see how using the idea of 1 by 1 convolutions, which you learnt about in the previous video, you'll be able to reduce the computational costs by about a factor of 10. To go from about 120 million multiplies to about one tenth of that. So please remember the number 120 so you can compare it with what you see on the next slide, 120 million. Here is an alternative architecture for inputting 28 by 28 by 192, and outputting 28 by 28 by 32, which is falling. You are going to input the volume, use a 1 by 1 convolution to reduce the volume to 16 channels instead of 192 channels, and then on this much smaller volume, run your 5 by 5 convolution to give you your final output. So notice the input and output dimensions are still the same. You input 28 by 28 by 192 and output 28 by 28 by 32, same as the previous slide. But what we've done is we're taking this huge volume we had on the left, and we shrunk it to this much smaller intermediate volume, which only has 16 instead of 192 channels. Sometimes this is called a bottleneck layer, right? I guess because a bottleneck is usually the smallest part of something, right? So I guess if you have a glass bottle that looks like this, then you know this is I guess where the cork goes. And then the bottleneck is the smallest part of this bottle. So in the same way, the bottleneck layer is the smallest part of this network. We shrink the representation before increasing the size again. Now let's look at the computational costs involved. To apply this 1 by 1 convolution, we have 16 filters. Each of the filters is going to be of dimension 1 by 1 by 192, this 192 matches that 192. And so the cost of computing this 28 by 28 by 16 volumes is going to be well, you need these many outputs, and for each of them you need to do 192 multiplications. I could have written 1 times 1 times 192, right? Which is this. And if you multiply this out, this is 2.4 million, it's about 2.4 million. How about the second? So that's the cost of this first convolutional layer. The cost of this second convolutional layer would be that well, you have these many outputs. So 28 by 28 by 32. And then for each of the outputs you have to apply a 5 by 5 by 16 dimensional filter. And so by 5 by 5 by 16. And you multiply that out is equals to 10.0. And so the total number of multiplications you need to do is the sum of those which is 12.4 million multiplications. And you compare this with what we had on the previous slide, you reduce the computational cost from about 120 million multiplies, down to about one tenth of that, to 12.4 million multiplications. And the number of additions you need to do is about very similar to the number of multiplications you need to do. So that's why I'm just counting the number of multiplications. So to summarize, if you are building a layer of a neural network and you don't want to have to decide, do you want a 1 by 1, or 3 by 3, or 5 by 5, or pooling layer, the inception module let's you say let's do them all, and let's concatenate the results. And then we run to the problem of computational cost. And what you saw here was how using a 1 by 1 convolution, you can create this bottleneck layer thereby reducing the computational cost significantly. Now you might be wondering, does shrinking down the representation size so dramatically, does it hurt the performance of your neural network? It turns out that so long as you implement this bottleneck layer so that within reason, you can shrink down the representation size significantly, and it doesn't seem to hurt the performance, but saves you a lot of computation. So these are the key ideas of the inception module. Let's put them together and in the next video show you what the full inception network looks like.

Inception Network - 8m

0:00
In a previous video, you've already seen all the basic building blocks of the Inception network. In this video, let's see how you can put these building blocks together to build your own Inception network.
0:13
So the inception module takes as input the activation or the output from some previous layer. So let's say for the sake of argument this is 28 by 28 by 192, same as our previous video. The example we worked through in depth was the 1 by 1 followed by 5 by 5. There, so maybe the 1 by 1 has 16 channels and then the 5 by 5 will output a 28 by 28 by, let's say, 32 channels.
0:49
And this is the example we worked through on the last slide of the previous video.
0:54
Then to save computation on your 3 by 3 convolution you can also do the same here. And then the 3 by 3 outputs, 28 by 28 by 1 by 28.
1:09
And then maybe you want to consider a 1 by 1 convolution as well. There's no need to do a 1 by 1 conv followed by another 1 by 1 conv so there's just one step here and let's say these outputs 28 by 28 by 64. And then finally is the pulling layer.
1:34
So here I'm going to do something funny. In order to really concatenate all of these outputs at the end we are going to use the same type of padding for pooling. So that the output height and width is still 28 by 28. So we can concatenate it with these other outputs. But notice that if you do max-pooling, even with same padding, 3 by 3 filter is tried at 1. The output here will be 28 by 28, By 192. It will have the same number of channels and the same depth as the input that we had here. So, this seems like is has a lot of channels. So what we're going to do is actually add one more 1 by 1 conv layer to then to what we saw in the one by one convilational video, to strengthen the number of channels. So it gets us down to 28 by 28 by let's say, 32. And the way you do that, is to use 32 filters, of dimension 1 by 1 by 192. So that's why the output dimension has a number of channels shrunk down to 32. So then we don't end up with the pulling layer taking up all the channels in the final output.
3:02
And finally you take all of these blocks and you do channel concatenation. Just concatenate across this 64 plus 128 plus 32 plus 32 and this if you add it up this gives you a 28 by 28 by 256 dimension output. Concat is just this concatenating the blocks that we saw in the previous video.
3:33
So this is one inception module, and what the inception network does, is, more or less, put a lot of these modules together.
3:45
Here's a picture of the inception network, taken from the paper by Szegedy et al
3:53
And you notice a lot of repeated blocks in this. Maybe this picture looks really complicated. But if you look at one of the blocks there, that block is basically the inception module that you saw on the previous slide.
4:10
And subject to little details I won't discuss, this is another inception block. This is another inception block. There's some extra max pooling layers here to change the dimension
4:24
of the heightened width. But that's another inception block. And then there's another max put here to change the height and width but basically there's another inception block. But the inception network is just a lot of these blocks that you've learned about repeated to different positions of the network. But so you understand the inception block from the previous slide, then you understand the inception network.
4:49
It turns out that there's one last detail to the inception network if we read the optional research paper. Which is that there are these additional side-branches that I just added.
5:01
So what do they do? Well, the last few layers of the network is a fully connected layer followed by a softmax layer to try to make a prediction. What these side branches do is it takes some hidden layer and it tries to use that to make a prediction. So this is actually a softmax output and so is that. And this other side branch, again it is a hidden layer passes through a few layers like a few connected layers. And then has the softmax try to predict what's the output label.
5:35
And you should think of this as maybe just another detail of the inception that's worked. But what is does is it helps to ensure that the features computed. Even in the heading units, even at intermediate layers. That they're not too bad for protecting the output cause of a image. And this appears to have a regularizing effect on the inception network and helps prevent this network from overfitting.
6:03
And by the way, this particular Inception network was developed by authors at Google. Who called it GoogleNet, spelled like that, to pay homage to the network. That you learned about in an earlier video as well.
6:23
So I think it's actually really nice that the Deep Learning Community is so collaborative. And that there's such strong healthy respect for each other's' work in the Deep Learning Learning community. FInally here's one fun fact. Where does the name inception network come from?
6:41
The inception paper actually cites this meme for we need to go deeper. And this URL is an actual reference in the inception paper, which links to this image. And if you've seen the movie titled The Inception, maybe this meme will make sense to you. But the authors actually cite this meme as motivation for needing to build deeper new networks. And that's how they came up with the inception architecture. So I guess it's not often that research papers get to cite Internet memes in their citations. But in this case, I guess it worked out quite well.
7:23
So to summarize, if you understand the inception module, then you understand the inception network. Which is largely the inception module repeated a bunch of times throughout the network. Since the development of the original inception module, the author and others have built on it and come up with other versions as well. So there are research papers on newer versions of the inception algorithm. And you sometimes see people use some of these later versions as well in their work, like inception v2, inception v3, inception v4. There's also an inception version. This combined with the resonant idea of having skipped connections, and that sometimes works even better. But all of these variations are built on the basic idea that you learned about this in the previous video of coming up with the inception module and then stacking up a bunch of them together. And with these videos you should be able to read and understand, I think, the inception paper, as well as maybe some of the papers describing the later derivation as well. So that's it, you've gone through quite a lot of specialized neural network architectures. In the next video, I want to start showing you some more practical advice on how you actually use these algorithms to build your own computer vision system. Let's go on to the next video.

Using Open-Source Implementation - 4m

0:01
You've now learned about several highly effective neural network and ConvNet architectures. What I want to do in the next few videos is share with you some practical advice on how to use them, first starting with using open source implementations. It turns out that a lot of these neural networks are difficult or finicky to replicate because a lot of details about tuning of the hyperparameters such as learning decay and other things that make some difference to the performance. And so I've found that it's sometimes difficult even for, say, a higher deep loving PhD students, even at the top universities to replicate someone else's polished work just from reading their paper. Fortunately, a lot of deep learning researchers routinely open source their work on the Internet, such as on GitHub. And as you do work yourself, I certainly encourage you to consider contributing back your code to the open source community. But if you see a research paper whose results you would like to build on top of, one thing you should consider doing, one thing I do quite often it's just look online for an open source implementation. Because if you can get the author's implementation, you can usually get going much faster than if you would try to reimplement it from scratch. Although sometimes reimplementing from scratch could be a good exercise to do as well. If you're already familiar with how to use GitHub, this video might be less necessary or less important for you. But if you aren't used to downloading open-source code from GitHub, let me quickly show you how easy it is.
1:42
Let's say you're excited about residual networks, and you want to use it. So let's search for residence on GitHub.
1:50
And so you actually see a lot of different implementations of residence on GitHub. And I'm just going to go to the first URL here. And this is a GitHub repo that implements residence. Along with the GitHub webpages if you scroll down we'll have some text describing the work or the particular implementation. On this particular repo, this particular GitHub repository was actually by the original authors of the ResNet paper. And this code, this license under an MIT license, you can click through to take a look at the implications of this license. The MIT License is one of the more permissive or one of the more open open-source licenses. So I'm going to go ahead and download the code, and to do that, click on this link. This gives you the URL that you can use to download the code. I'm going to click on this button over here to copy the URL to my clipboard and then go over here. Then all you have to do is type git clone and then Ctrl+V for the URL and hit Enter. And so in a couples of seconds it has download, has cloned this repository to my local hard disk. So let's go into the directory and let's take a look. I'm more used in Mac than Windows, but I guess let's see, let's go to prototxt and I think this is where it has the files specifying the network. So let's take a look at this file, because this is a very long file that specifies the detail configurations of the ResNet with a 101 layers, all right? And it looks like from what I remember seeing from this webpage, this particular implementation uses the Cafe framework.
3:39
But if you wanted implementation of this code using some other programming framework, you might be able to find it as well.
3:48
So if you're developing a computer vision application, a very common workflow would be to pick an architecture that you like, maybe one of the ones you learned about in this course. Or maybe one that you heard about from a friend or from some literature. And look for an open source implementation and download it from GitHub to start building from there. One of the advantages of doing so also is that sometimes these networks take a long time to train, and someone else might have used multiple GPUs and a very large dataset to pretrain some of these networks. And that allows you to do transfer learning using these networks which we'll discuss in the next video as well. Of course if you're computer vision researcher implementing these things from scratch, then your workflow will be different. And if you do that, then do contribute your work back to the open source community. But because so many vision researchers have done so much work implementing these architectures, I found that often starting with open-source implementations is a better way, or certainly a faster way to get started on a new project.

Transfer Learning - 8m

0:00
If you're building a computer vision application rather than training the ways from scratch, from random initialization, you often make much faster progress if you download ways that someone else has already trained on the network architecture and use that as pre-training and transfer that to a new task that you might be interested in. The computer vision research community has been pretty good at posting lots of data sets on the Internet so if you hear of things like Image Net, or MS COCO, or Pascal types of data sets, these are the names of different data sets that people have post online and a lot of computer researchers have trained their algorithms on. Sometimes these training takes several weeks and might take many GP use and the fact that someone else has done this and gone through the painful high-performance search process, means that you can often download open source ways that took someone else many weeks or months to figure out and use that as a very good initialization for your own neural network. And use transfer learning to sort of transfer knowledge from some of these very large public data sets to your own problem. Let's take a deeper look at how to do this. Let's start with the example, let's say you're building a cat detector to recognize your own pet cat. According to the internet, Tigger is a common cat name and Misty is another common cat name. Let's say your cats are called Tiger and Misty and there's also neither. You have a classification problem with three clauses. Is this picture Tigger, or is it Misty, or is it neither. And in all the case of both of you cats appearing in the picture. Now, you probably don't have a lot of pictures of Tigger or Misty so your training set will be small. What can you do? I recommend you go online and download some open source implementation of a neural network and download not just the code but also the weights. There are a lot of networks you can download that have been trained on for example, the Init Net data sets which has a thousand different clauses so the network might have a softmax unit that outputs one of a thousand possible clauses. What you can do is then get rid of the softmax layer and create your own softmax unit that outputs Tigger or Misty or neither. In terms of the network, I'd encourage you to think of all of these layers as frozen so you freeze the parameters in all of these layers of the network and you would then just train the parameters associated with your softmax layer. Which is the softmax layer with three possible outputs, Tigger, Misty or neither. By using someone else's free trade ways, you might probably get pretty good performance on this even with a small data set. Fortunately, a lot of people learning frameworks support this mode of operation and in fact, depending on the framework it might have things like trainable parameter equals zero, you might set that for some of these early layers. In others they just say, don't train those ways or sometimes you have a parameter like freeze equals one and these are different ways and different deep learning program frameworks that let you specify whether or not to train the ways associated with particular layer. In this case, you will train only the softmax layers ways but freeze all of the earlier layers ways. One other neat trick that may help for some implementations is that because all of these early leads are frozen, there are some fixed function that doesn't change because you're not changing it, you not training it that takes this input image acts and maps it to some set of activations in that layer. One of the trick that could speed up training is we just pre-compute that layer, the features of re-activations from that layer and just save them to disk. What you're doing is using this fixed function, in this first part of the neural network, to take this input any image X and compute some feature vector for it and then you're training a shallow softmax model from this feature vector to make a prediction. One step that could help your computation as you just pre-compute that layers activation, for all the examples in training sets and save them to disk and then just train the softmax clause right on top of that. The advantage of the safety disk or the pre-compute method or the safety disk is that you don't need to recompute those activations everytime you take a epoch or take a post through a training set. This is what you do if you have a pretty small training set for your task. Whether you have a larger training set. One rule of thumb is if you have a larger label data set so maybe you just have a ton of pictures of Tigger, Misty as well as I guess pictures neither of them, one thing you could do is then freeze fewer layers. Maybe you freeze just these layers and then train these later layers. Although if the output layer has different clauses then you need to have your own output unit any way Tigger, Misty or neither. There are a couple of ways to do this. You could take the last few layers ways and just use that as initialization and do gradient descent from there or you can also blow away these last few layers and just use your own new hidden units and in your own final softmax outputs. Either of these matters could be worth trying. But maybe one pattern is if you have more data, the number of layers you've freeze could be smaller and then the number of layers you train on top could be greater. And the idea is that if you pick a data set and maybe have enough data not just to train a single softmax unit but to train some other size neural network that comprises the last few layers of this final network that you end up using. Finally, if you have a lot of data, one thing you might do is take this open source network and ways and use the whole thing just as initialization and train the whole network. Although again if this was a thousand of softmax and you have just three outputs, you need your own softmax output. The output of labels you care about. But the more label data you have for your task or the more pictures you have of Tigger, Misty and neither, the more layers you could train and in the extreme case, you could use the ways you download just as initialization so they would replace random initialization and then could do gradient descent, training updating all the ways and all the layers of the network. That's transfer learning for the training of ConvNets. In practice, because the open data sets on the internet are so big and the ways you can download that someone else has spent weeks training has learned from so much data, you find that for a lot of computer vision applications, you just do much better if you download someone else's open source ways and use that as initialization for your problem. In all the different disciplines, in all the different applications of deep learning, I think that computer vision is one where transfer learning is something that you should almost always do unless, you have an exceptionally large data set to train everything else from scratch yourself. But transfer learning is just very worth seriously considering unless you have an exceptionally large data set and a very large computation budget to train everything from scratch by yourself.

Data Augmentation - 9m

0:00
Most computer vision task could use more data. And so data augmentation is one of the techniques that is often used to improve the performance of computer vision systems. I think that computer vision is a pretty complicated task. You have to input this image, all these pixels and then figure out what is in this picture. And it seems like you need to learn the decently complicated function to do that. And in practice, there almost all competing visions task having more data will help. This is unlike some other domains where sometimes you can get enough data, they don't feel as much pressure to get even more data. But I think today, this data computer vision is that, for the majority of computer vision problems, we feel like we just can't get enough data. And this is not true for all applications of machine learning, but it does feel like it's true for computer vision. So, what that means is that when you're training in computer vision model, often data augmentation will help. And this is true whether you're using transfer learning or using someone else's pre-trained ways to start, or whether you're trying to train something yourself from scratch. Let's take a look at the common data augmentation that is in computer vision. Perhaps the simplest data augmentation method is mirroring on the vertical axis, where if you have this example in your training set, you flip it horizontally to get that image on the right. And for most computer vision task, if the left picture is a cat then mirroring it is though a cat. And if the mirroring operation preserves whatever you're trying to recognize in the picture, this would be a good data augmentation technique to use. Another commonly used technique is random cropping. So given this dataset, let's pick a few random crops. So you might pick that, and take that crop or you might take that, to that crop, take this, take that crop and so this gives you different examples to feed in your training sample, sort of different random crops of your datasets. So random cropping isn't a perfect data augmentation. What if you randomly end up taking that crop which will look much like a cat but in practice and worthwhile so long as your random crops are reasonably large subsets of the actual image. So, mirroring and random cropping are frequently used and in theory, you could also use things like rotation, shearing of the image, so that's if you do this to the image, distort it that way, introduce various forms of local warping and so on. And there's really no harm with trying all of these things as well, although in practice they seem to be used a bit less, or perhaps because of their complexity. The second type of data augmentation that is commonly used is color shifting. So, given a picture like this, let's say you add to the R, G and B channels different distortions. In this example, we are adding to the red and blue channels and subtracting from the green channel. So, red and blue make purple. So, this makes the whole image a bit more purpley and that creates a distorted image for training set. For illustration purposes, I'm making somewhat dramatic changes to the colors and practice, you draw R, G and B from some distribution that could be quite small as well. But what you do is take different values of R, G, and B and use them to distort the color channels. So, in the second example, we are making a less red, and more green and more blue, so that turns our image a bit more yellowish. And here, we are making it much more blue, just a tiny little bit longer. But in practice, the values R, G and B, are drawn from some probability distribution. And the motivation for this is that if maybe the sunlight was a bit yellow or maybe the in-goal illumination was a bit more yellow, that could easily change the color of an image, but the identity of the cat or the identity of the content, the label y, just still stay the same. And so introducing these color distortions or by doing color shifting, this makes your learning algorithm more robust to changes in the colors of your images. Just a comment for the advanced learners in this course, that is okay if you don't understand what I'm about to say when using red. There are different ways to sample R, G, and B. One of the ways to implement color distortion uses an algorithm called PCA. This is called Principles Component Analysis, which I talked about in the ml-class.org Machine Learning Course on Coursera. But the details of this are actually given in the AlexNet paper, and sometimes called PCA Color Augmentation. But the rough idea at the time PCA Color Augmentation is for example, if your image is mainly purple, if it mainly has red and blue tints, and very little green, then PCA Color Augmentation, will add and subtract a lot to red and blue, where it balance [inaudible] all the greens, so kind of keeps the overall color of the tint the same. If you didn't understand any of this, don't worry about it. But if you can search online for that, you can and if you want to read about the details of it in the AlexNet paper, and you can also find some open source implementations of the PCA Color Augmentation, and just use that. So, you might have your training data stored in a hard disk and uses symbol, this round bucket symbol to represent your hard disk. And if you have a small training set, you can do almost anything and you'll be okay. But the very last training set and this is how people will often implement it, which is you might have a CPU thread that is constantly loading images of your hard disk. So, you have this stream of images coming in from your hard disk. And what you can do is use maybe a CPU thread to implement the distortions, yet the random cropping, or the color shifting, or the mirroring, but for each image, you might then end up with some distorted version of it. So, let's see this image, I'm going to mirror it and if you also implement colors distortion and so on. And if this image ends up being color shifted, so you end up with some different colored cat. And so your CPU thread is constantly loading data as well as implementing whether the distortions are needed to form a batch or really many batches of data. And this data is then constantly passed to some other thread or some other process for implementing training and this could be done on the CPU or really increasingly on the GPU if you have a large neural network to train. And so, a pretty common way of implementing data augmentation is to really have one thread, almost four threads, that is responsible for loading the data and implementing distortions, and then passing that to some other thread or some other process that then does the training. And often, this and this, can run in parallel. So, that's it for data augmentation. And similar to other parts of training a deep neural network, the data augmentation process also has a few hyperparameters such as how much color shifting do you implement and exactly what parameters you use for random cropping? So, similar to elsewhere in computer vision, a good place to get started might be to use someone else's open source implementation for how they use data augmentation. But of course, if you want to capture more in variances, then you think someone else's open source implementation isn't, it might be reasonable also to use hyperparameters yourself. So with that, I hope that you're going to use data augmentation, to get your computer vision applications to work better.

State of Computer Vision - 12m

0:00
Deep learning has been successfully applied to computer vision, natural language processing, speech recognition, online advertising, logistics, many, many, many problems. There are a few things that are unique about the application of deep learning to computer vision, about the status of computer vision. In this video, I will share with you some of my observations about deep learning for computer vision and I hope that that will help you better navigate the literature, and the set of ideas out there, and how you build these systems yourself for computer vision. So, you can think of most machine learning problems as falling somewhere on the spectrum between where you have relatively little data to where you have lots of data. So, for example I think that today we have a decent amount of data for speech recognition and it's relative to the complexity of the problem. And even though there are reasonably large data sets today for image recognition or image classification, because image recognition is just a complicated problem to look at all those pixels and figure out what it is. It feels like even though the online data sets are quite big like over a million images, feels like we still wish we had more data. And there are some problems like object detection where we have even less data. So, just as a reminder image recognition was the problem of looking at a picture and telling you is this a cattle or not. Whereas object detection is look in the picture and actually you're putting the bounding boxes are telling you where in the picture the objects such as the car as well. And so because of the cost of getting the bounding boxes is just more expensive to label the objects and the bounding boxes. So, we tend to have less data for object detection than for image recognition. And object detection is something we'll discuss next week. So, if you look across a broad spectrum of machine learning problems, you see on average that when you have a lot of data you tend to find people getting away with using simpler algorithms as well as less hand-engineering. So, there's just less needing to carefully design features for the problem, but instead you can have a giant neural network, even a simpler architecture, and have a neural network. Just learn whether we want to learn we have a lot of data. Whereas, in contrast when you don't have that much data then on average you see people engaging in more hand-engineering. And if you want to be ungenerous you can say there are more hacks. But I think when you don't have much data then hand-engineering is actually the best way to get good performance. So, when I look at machine learning applications I think usually we have the learning algorithm has two sources of knowledge. One source of knowledge is the labeled data, really the (x,y) pairs you use for supervised learning. And the second source of knowledge is the hand-engineering. And there are lots of ways to hand-engineer a system. It can be from carefully hand designing the features, to carefully hand designing the network architectures to maybe other components of your system. And so when you don't have much labeled data you just have to call more on hand-engineering. And so I think computer vision is trying to learn a really complex function. And it often feels like we don't have enough data for computer vision. Even though data sets are getting bigger and bigger, often we just don't have as much data as we need. And this is why this data computer vision historically and even today has relied more on hand-engineering. And I think this is also why that either computer vision has developed rather complex network architectures, is because in the absence of more data the way to get good performance is to spend more time architecting, or fooling around with the network architecture. And in case you think I'm being derogatory of hand-engineering that's not at all my intent. When you don't have enough data hand-engineering is a very difficult, very skillful task that requires a lot of insight. And someone that is insightful with hand-engineering will get better performance, and is a great contribution to a project to do that hand-engineering when you don't have enough data. It's just when you have lots of data then I wouldn't spend time hand-engineering, I would spend time building up the learning system instead. But I think historically the fear the computer vision has used very small data sets, and so historically the computer vision literature has relied on a lot of hand-engineering. And even though in the last few years the amount of data with the right computer vision task has increased dramatically, I think that that has resulted in a significant reduction in the amount of hand-engineering that's being done. But there's still a lot of hand-engineering of network architectures and computer vision. Which is why you see very complicated hyper frantic choices in computer vision, are more complex than you do in a lot of other disciplines. And in fact, because you usually have smaller object detection data sets than image recognition data sets, when we talk about object detection that is task like this next week. You see that the algorithms become even more complex and has even more specialized components. Fortunately, one thing that helps a lot when you have little data is transfer learning. And I would say for the example from the previous slide of the tigger, misty, neither detection problem, you have soluble data that transfer learning will help a lot. And so that's another set of techniques that's used a lot for when you have relatively little data. If you look at the computer vision literature, and look at the sort of ideas out there, you also find that people are really enthusiastic. They're really into doing well on standardized benchmark data sets and on winning competitions. And for computer vision researchers if you do well and the benchmark is easier to get the paper published. So, there's just a lot of attention on doing well on these benchmarks. And the positive side of this is that, it helps the whole community figure out what are the most effective algorithms. But you also see in the papers people do things that allow you to do well on a benchmark, but that you wouldn't really use in a production or a system that you deploy in an actual application. So, here are a few tips on doing well on benchmarks. These are things that I don't myself pretty much ever use if I'm putting a system to production that is actually to serve customers. But one is ensembling. And what that means is, after you've figured out what neural network you want, train several neural networks independently and average their outputs. So, initialize say 3, or 5, or 7 neural networks randomly and train up all of these neural networks, and then average their outputs. And by the way it is important to average their outputs y hats. Don't average their weights that won't work. Look and you say seven neural networks that have seven different predictions and average that. And this will cause you to do maybe 1% better, or 2% better. So is a little bit better on some benchmark. And this will cause you to do a little bit better. Maybe sometimes as much as 1 or 2% which really help win a competition. But because ensembling means that to test on each image, you might need to run an image through anywhere from say 3 to 15 different networks quite typical. This slows down your running time by a factor of 3 to 15, or sometimes even more. And so ensembling is one of those tips that people use doing well in benchmarks and for winning competitions. But that I think is almost never use in production to serve actual customers. I guess unless you have huge computational budget and don't mind burning a lot more of it per customer image. Another thing you see in papers that really helps on benchmarks, is multi-crop at test time. So, what I mean by that is you've seen how you can do data augmentation. And multi-crop is a form of applying data augmentation to your test image as well. So, for example let's see a cat image and just copy it four times including two more versions. There's a technique called the 10-crop, which basically says let's say you take this central region that crop, and run it through your crossfire. And then take that crop up the left hand corner run through a crossfire, up right hand corner shown in green, lower left shown in yellow, lower right shown in orange, and run that through the crossfire. And then do the same thing with the mirrored image. Right. So I'll take the central crop, then take the four corners crops. So, that's one central crop here and here, there's four corners crop here and here. And if you add these up that's 10 different crops that you mentioned. So hence the name 10-crop. And so what you do, is you run these 10 images through your crossfire and then average the results. So, if you have the computational budget you could do this. Maybe you don't need as many as 10-crops, you can use a few crops. And this might get you a little bit better performance in a production system. By production I mean a system you're deploying for actual users. But this is another technique that is used much more for doing well on benchmarks than in actual production systems. And one of the big problems of ensembling is that you need to keep all these different networks around. And so that just takes up a lot more computer memory. For multi-crop I guess at least you keep just one network around. So it doesn't suck up as much memory, but it still slows down your run time quite a bit. So, these are tips you see and research papers will refer to these tips as well. But I personally do not tend to use these methods when building production systems even though they are great for doing better on benchmarks and on winning competitions. Because a lot of the computer vision problems are in the small data regime, others have done a lot of hand-engineering of the network architectures. And a neural network that works well on one vision problem often may be surprisingly, but they just often would work on other vision problems as well. So, to build a practical system often you do well starting off with someone else's neural network architecture. And you can use an open source implementation if possible, because the open source implementation might have figured out all the finicky details like the learning rate, case scheduler, and other hyper parameters. And finally someone else may have spent weeks training a model on half a dozen GP use and on over a million images. And so by using someone else's pretrained model and fine tuning on your data set, you can often get going much faster on an application. But of course if you have the compute resources and the inclination, don't let me stop you from training your own networks from scratch. And in fact if you want to invent your own computer vision algorithm, that's what you might have to do. So, that's it for this week, I hope that seeing a number of computer vision architectures helps you get a sense of what works. In this week's programming exercises you actually learn another programming framework and use that to implement resonance. So, I hope you enjoy that programming exercise and I look forward to seeing you next week.

Object detection

Learn how to apply your knowledge of CNNs to one of the toughest but hottest field of computer vision: Object detection.

Object Localization - 11m

0:01
Hello and welcome back. This week you learn about object detection. This is one of the areas of computer vision that's just exploding and is working so much better than just a couple of years ago. In order to build up to object detection, you first learn about object localization. Let's start by defining what that means. You're already familiar with the image classification task where an algorithm looks at this picture and might be responsible for saying this is a car. So that was classification.
0:34
The problem you learn to build in your network to address later on this video is classification with localization. Which means not only do you have to label this as say a car but the algorithm also is responsible for putting a bounding box, or drawing a red rectangle around the position of the car in the image. So that's called the classification with localization problem. Where the term localization refers to figuring out where in the picture is the car you've detective. Later this week, you then learn about the detection problem where now there might be multiple objects in the picture and you have to detect them all and and localized them all. And if you're doing this for an autonomous driving application, then you might need to detect not just other cars, but maybe other pedestrians and motorcycles and maybe even other objects. So you'll see that later this week. So in the terminology we'll use this week, the classification and the classification of localization problems usually have one object. Usually one big object in the middle of the image that you're trying to recognize or recognize and localize. In contrast, in the detection problem there can be multiple objects. And in fact, maybe even multiple objects of different categories within a single image. So the ideas you've learned about for image classification will be useful for classification with localization. And that the ideas you learn for localization will then turn out to be useful for detection. So let's start by talking about classification with localization.
2:15
You're already familiar with the image classification problem, in which you might input a picture into a ConfNet with multiple layers so that's our ConfNet. And this results in a vector features that is fed to maybe a softmax unit that outputs the predicted clause. So if you are building a self driving car, maybe your object categories are the following. Where you might have a pedestrian, or a car, or a motorcycle, or a background. This means none of the above. So if there's no pedestrian, no car, no motorcycle, then you might have an output background. So these are your classes, they have a softmax with four possible outputs. So this is the standard classification pipeline. How about if you want to localize the car in the image as well. To do that, you can change your neural network to have a few more output units that output a bounding box. So, in particular, you can have the neural network output four more numbers, and I'm going to call them bx, by, bh, and bw. And these four numbers parameterized the bounding box of the detected object. So in these videos, I am going to use the notational convention that the upper left of the image, I'm going to denote as the coordinate (0,0), and at the lower right is (1,1). So, specifying the bounding box, the red rectangle requires specifying the midpoint. So that’s the point bx, by as well as the height, that would be bh, as well as the width, bw of this bounding box. So now if your training set contains not just the object cross label, which a neural network is trying to predict up here, but it also contains four additional numbers. Giving the bounding box then you can use supervised learning to make your algorithm outputs not just a class label but also the four parameters to tell you where is the bounding box of the object you detected. So in this example the ideal bx might be about 0.5 because this is about halfway to the right to the image. by might be about 0.7 since it's about maybe 70% to the way down to the image. bh might be about 0.3 because the height of this red square is about 30% of the overall height of the image. And bw might be about 0.4 let's say because the width of the red box is about 0.4 of the overall width of the entire image.
5:15
So let's formalize this a bit more in terms of how we define the target label y for this as a supervised learning task. So just as a reminder these are our four classes, and the neural network now outputs those four numbers as well as a class label,
5:36
or maybe probabilities of the class labels.
5:40
So, let's define the target label y as follows. Is going to be a vector where the first component pc is going to be, is there an object?
5:55
So, if the object is, classes 1, 2 or 3, pc will be equal to 1. And if it's the background class, so if it's none of the objects you're trying to detect, then pc will be 0. And pc you can think of that as standing for the probability that there's an object. Probability that one of the classes you're trying to detect is there. So something other than the background class. Next if there is an object, then you wanted to output bx, by, bh and bw, the bounding box for the object you detected. And finally if there is an object, so if pc is equal to 1, you wanted to also output c1, c2 and c3 which tells us is it the class 1, class 2 or class 3. So is it a pedestrian, a car or a motorcycle. And remember in the problem we're addressing we assume that your image has only one object. So at most, one of these objects appears in the picture, in this classification with localization problem. So let's go through a couple of examples. If this is a training set image, so if that is x, then y will be the first component pc will be equal to 1 because there is an object, then bx, by, by, bh and bw will specify the bounding box. So your labeled training set will need bounding boxes in the labels. And then finally this is a car, so it's class 2. So c1 will be 0 because it's not a pedestrian, c2 will be 1 because it is car, c3 will be 0 since it is not a motorcycle. So among c1, c2 and c3 at most one of them should be equal to 1. So that's if there's an object in the image. What if there's no object in the image? What if we have a training example where x is equal to that? In this case, pc would be equal to 0, and the rest of the elements of this, will be don't cares, so I'm going to write question marks in all of them. So this is a don't care, because if there is no object in this image, then you don't care what bounding box the neural network outputs as well as which of the three objects, c1, c2, c3 it thinks it is. So given a set of label training examples, this is how you will construct x, the input image as well as y, the cost label both for images where there is an object and for images where there is no object. And the set of this will then define your training set.
8:47
Finally, next let's describe the loss function you use to train the neural network. So the ground true label was y and the neural network outputs some yhat. What should be the loss be? Well if you're using squared error then the loss can be (y1 hat- y1) squared + (y2 hat- y2) squared + ...+( y8 hat- y8) squared. Notice that y here has eight components. So that goes from sum of the squares of the difference of the elements. And that's the loss if y1=1. So that's the case where there is an object. So y1= pc. So, pc = 1, that if there is an object in the image then the loss can be the sum of squares of all the different elements.
9:48
The other case is if y1=0, so that's if this pc = 0. In that case the loss can be just (y1 hat-y1) squared, because in that second case, all of the rest of the components are don't care us. And so all you care about is how accurately is the neural network ourputting pc in that case. So just a recap, if y1 = 1, that's this case, then you can use squared error to penalize square deviation from the predicted, and the actual output of all eight components. Whereas if y1 = 0, then the second to the eighth components I don't care. So all you care about is how accurately is your neural network estimating y1, which is equal to pc. Just as a side comment for those of you that want to know all the details, I've used the squared error just to simplify the description here. In practice you could probably use a log like feature loss for the c1, c2, c3 to the softmax output. One of those elements usually you can use squared error or something like squared error for the bounding box coordinates and if a pc you could use something like the logistics regression loss. Although even if you use squared error it'll probably work okay. So that's how you get a neural network to not just classify an object but also to localize it. The idea of having a neural network output a bunch of real numbers to tell you where things are in a picture turns out to be a very powerful idea. In the next video I want to share with you some other places where this idea of having a neural network output a set of real numbers, almost as a regression task, can be very powerful to use elsewhere in computer vision as well. So let's go on to the next video.

Landmark Detection - 5m

0:00
In the previous video, you saw how you can get a neural network to output four numbers of bx, by, bh, and bw to specify the bounding box of an object you want a neural network to localize. In more general cases, you can have a neural network just output X and Y coordinates of important points and image, sometimes called landmarks, that you want the neural networks to recognize. Let me show you a few examples. Let's say you're building a face recognition application and for some reason, you want the algorithm to tell you where is the corner of someone's eye. So that point has an X and Y coordinate, so you can just have a neural network have its final layer and have it just output two more numbers which I'm going to call our lx and ly to just tell you the coordinates of that corner of the person's eye. Now, what if you want it to tell you all four corners of the eye, really of both eyes. So, if we call the points, the first, second, third and fourth points going from left to right, then you could modify the neural network now to output l1x, l1y for the first point and l2x, l2y for the second point and so on, so that the neural network can output the estimated position of all those four points of the person's face. But what if you don't want just those four points? What do you want to output this point, and this point and this point and this point along the eye? Maybe I'll put some key points along the mouth, so you can extract the mouth shape and tell if the person is smiling or frowning, maybe extract a few key points along the edges of the nose but you could define some number, for the sake of argument, let's say 64 points or 64 landmarks on the face. Maybe even some points that help you define the edge of the face, defines the jaw line but by selecting a number of landmarks and generating a label training sets that contains all of these landmarks, you can then have the neural network to tell you where are all the key positions or the key landmarks on a face. So what you do is you have this image, a person's face as input, have it go through a convnet and have a convnet, then have some set of features, maybe have it output 0 or 1, like zero face changes or not and then have it also output l1x, l1y and so on down to l64x, l64y. And here I'm using l to stand for a landmark. So this example would have 129 output units, one for is your face or not? And then if you have 64 landmarks, that's sixty-four times two, so 128 plus one output units and this can tell you if there's a face as well as where all the key landmarks on the face. So, this is a basic building block for recognizing emotions from faces and if you played with the Snapchat and the other entertainment, also AR augmented reality filters like the Snapchat photos can draw a crown on the face and have other special effects. Being able to detect these landmarks on the face, there's also a key building block for the computer graphics effects that warp the face or drawing various special effects like putting a crown or a hat on the person. Of course, in order to treat a network like this, you will need a label training set. We have a set of images as well as labels Y where people, where someone will have had to go through and laboriously annotate all of these landmarks. One last example, if you are interested in people pose detection, you could also define a few key positions like the midpoint of the chest, the left shoulder, left elbow, the wrist, and so on, and just have a neural network to annotate key positions in the person's pose as well and by having a neural network output, all of those points I'm annotating, you could also have the neural network output the pose of the person. And of course, to do that you also need to specify on these key landmarks like maybe l1x and l1y is the midpoint of the chest down to maybe l32x, l32y, if you use 32 coordinates to specify the pose of the person. So, this idea might seem quite simple of just adding a bunch of output units to output the X,Y coordinates of different landmarks you want to recognize. To be clear, the identity of landmark one must be consistent across different images like maybe landmark one is always this corner of the eye, landmark two is always this corner of the eye, landmark three, landmark four, and so on. So, the labels have to be consistent across different images. But if you can hire labelers or label yourself a big enough data set to do this, then a neural network can output all of these landmarks which is going to used to carry out other interesting effect such as with the pose of the person, maybe try to recognize someone's emotion from a picture, and so on. So that's it for landmark detection. Next, let's take these building blocks and use it to start building up towards object detection.

Object Detection - 5m

0:00
You've learned about Object Localization as well as Landmark Detection. Now, let's build up to other object detection algorithm. In this video, you'll learn how to use a cofinite to perform object detection using something called the Sliding Windows Detection Algorithm. Let's say you want to build a car detection algorithm. Here's what you can do. You can first create a label training set, so x and y with closely cropped examples of cars. So, this is image x has a positive example, there's a car, here's a car, here's a car, and then there's not a car, there's not a car. And for our purposes in this training set, you can start off with the one with the car closely cropped images. Meaning that x is pretty much only the car. So, you can take a picture and crop out and just cut out anything else that's not part of a car. So you end up with the car centered in pretty much the entire image. Given this label training set, you can then train a cofinite that inputs an image, like one of these closely cropped images. And then the job of the cofinite is to output y, zero or one, is there a car or not. Once you've trained up this cofinite, you can then use it in Sliding Windows Detection. So the way you do that is, if you have a test image like this what you do is you start by picking a certain window size, shown down there. And then you would input into this cofinite a small rectangular region. So, take just this below red square, input that into the cofinite, and have a cofinite make a prediction. And presumably for that little region in the red square, it'll say, no that little red square does not contain a car. In the Sliding Windows Detection Algorithm, what you do is you then pass as input a second image now bounded by this red square shifted a little bit over and feed that to the cofinite. So, you're feeding just the region of the image in the red squares of the cofinite and run the cofinite again. And then you do that with a third image and so on. And you keep going until you've slid the window across every position in the image. And I'm using a pretty large stride in this example just to make the animation go faster. But the idea is you basically go through every region of this size, and pass lots of little cropped images into the cofinite and have it classified zero or one for each position as some stride. Now, having done this once with running this was called the sliding window through the image. You then repeat it, but now use a larger window. So, now you take a slightly larger region and run that region. So, resize this region into whatever input size the cofinite is expecting, and feed that to the cofinite and have it output zero or one. And then slide the window over again using some stride and so on. And you run that throughout your entire image until you get to the end. And then you might do the third time using even larger windows and so on. Right. And the hope is that if you do this, then so long as there's a car somewhere in the image that there will be a window where, for example if you are passing in this window into the cofinite, hopefully the cofinite will have outputs one for that input region. So then you detect that there is a car there. So this algorithm is called Sliding Windows Detection because you take these windows, these square boxes, and slide them across the entire image and classify every square region with some stride as containing a car or not. Now there's a huge disadvantage of Sliding Windows Detection, which is the computational cost. Because you're cropping out so many different square regions in the image and running each of them independently through a cofinite. And if you use a very coarse stride, a very big stride, a very big step size, then that will reduce the number of windows you need to pass through the cofinite, but that courser granularity may hurt performance. Whereas if you use a very fine granularity or a very small stride, then the huge number of all these little regions you're passing through the cofinite means that means there is a very high computational cost. So, before the rise of Neural Networks people used to use much simpler classifiers like a simple linear classifier over hand engineer features in order to perform object detection. And in that era because each classifier was relatively cheap to compute, it was just a linear function, Sliding Windows Detection ran okay. It was not a bad method, but with cofinite now running a single classification task is much more expensive and sliding windows this way is infeasibily slow. And unless you use a very fine granularity or a very small stride, you end up not able to localize the objects that accurately within the image as well. Fortunately however, this problem of computational cost has a pretty good solution. In particular, the Sliding Windows Object Detector can be implemented convolutionally or much more efficiently. Let's see in the next video how you can do that.

Convolutional Implementation of Sliding Windows - 11m

0:00
In the last video, you learned about the sliding windows object detection algorithm using a convnet but we saw that it was too slow. In this video, you'll learn how to implement that algorithm convolutionally. Let's see what this means. To build up towards the convolutional implementation of sliding windows let's first see how you can turn fully connected layers in neural network into convolutional layers. We'll do that first on this slide and then the next slide, we'll use the ideas from this slide to show you the convolutional implementation. So let's say that your object detection algorithm inputs 14 by 14 by 3 images. This is quite small but just for illustrative purposes, and let's say it then uses 5 by 5 filters, and let's say it uses 16 of them to map it from 14 by 14 by 3 to 10 by 10 by 16. And then does a 2 by 2 max pooling to reduce it to 5 by 5 by 16. Then has a fully connected layer to connect to 400 units. Then now they're fully connected layer and then finally outputs a Y using a softmax unit. In order to make the change we'll need to in a second, I'm going to change this picture a little bit and instead I'm going to view Y as four numbers, corresponding to the cause probabilities of the four causes that softmax units is classified amongst. And the full causes could be pedestrian, car, motorcycle, and background or something else. Now, what I'd like to do is show how these layers can be turned into convolutional layers. So, the convnet will draw same as before for the first few layers. And now, one way of implementing this next layer, this fully connected layer is to implement this as a 5 by 5 filter and let's use 400 5 by 5 filters. So if you take a 5 by 5 by 16 image and convolve it with a 5 by 5 filter, remember, a 5 by 5 filter is implemented as 5 by 5 by 16 because our convention is that the filter looks across all 16 channels. So this 16 and this 16 must match and so the outputs will be 1 by 1. And if you have 400 of these 5 by 5 by 16 filters, then the output dimension is going to be 1 by 1 by 400. So rather than viewing these 400 as just a set of nodes, we're going to view this as a 1 by 1 by 400 volume. Mathematically, this is the same as a fully connected layer because each of these 400 nodes has a filter of dimension 5 by 5 by 16. So each of those 400 values is some arbitrary linear function of these 5 by 5 by 16 activations from the previous layer. Next, to implement the next convolutional layer, we're going to implement a 1 by 1 convolution. If you have 400 1 by 1 filters then, with 400 filters the next layer will again be 1 by 1 by 400. So that gives you this next fully connected layer. And then finally, we're going to have another 1 by 1 filter, followed by a softmax activation. So as to give a 1 by 1 by 4 volume to take the place of these four numbers that the network was operating. So this shows how you can take these fully connected layers and implement them using convolutional layers so that these sets of units instead are not implemented as 1 by 1 by 400 and 1 by 1 by 4 volumes. After this conversion, let's see how you can have a convolutional implementation of sliding windows object detection. The presentation on this slide is based on the OverFeat paper, referenced at the bottom, by Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Robert Fergus and Yann Lecun. Let's say that your sliding windows convnet inputs 14 by 14 by 3 images and again, I'm just using small numbers like the 14 by 14 image in this slide mainly to make the numbers and illustrations simpler. So as before, you have a neural network as follows that eventually outputs a 1 by 1 by 4 volume, which is the output of your softmax. Again, to simplify the drawing here, 14 by 14 by 3 is technically a volume 5 by 5 or 10 by 10 by 16, the second clear volume. But to simplify the drawing for this slide, I'm just going to draw the front face of this volume. So instead of drawing 1 by 1 by 400 volume, I'm just going to draw the 1 by 1 cause of all of these. So just dropped the three components of these drawings, just for this slide. So let's say that your convnet inputs 14 by 14 images or 14 by 14 by 3 images and your tested image is 16 by 16 by 3. So now added that yellow stripe to the border of this image. In the original sliding windows algorithm, you might want to input the blue region into a convnet and run that once to generate a consecration 01 and then slightly down a bit, least he uses a stride of two pixels and then you might slide that to the right by two pixels to input this green rectangle into the convnet and we run the whole convnet and get another label, 01. Then you might input this orange region into the convnet and run it one more time to get another label. And then do it the fourth and final time with this lower right purple square. To run sliding windows on this 16 by 16 by 3 image is pretty small image. You run this convnet four times in order to get four labels. But it turns out a lot of this computation done by these four convnets is highly duplicative. So what the convolutional implementation of sliding windows does is it allows these four pauses in the convnet to share a lot of computation. Specifically, here's what you can do. You can take the convnet and just run it same parameters, the same 5 by 5 filters, also 16 5 by 5 filters and run it. Now, you can have a 12 by 12 by 16 output volume. Then do the max pool, same as before. Now you have a 6 by 6 by 16, runs through your same 400 5 by 5 filters to get now your 2 by 2 by 40 volume. So now instead of a 1 by 1 by 400 volume, we have instead a 2 by 2 by 400 volume. Run it through a 1 by 1 filter gives you another 2 by 2 by 400 instead of 1 by 1 like 400. Do that one more time and now you're left with a 2 by 2 by 4 output volume instead of 1 by 1 by 4. It turns out that this blue 1 by 1 by 4 subset gives you the result of running in the upper left hand corner 14 by 14 image. This upper right 1 by 1 by 4 volume gives you the upper right result. The lower left gives you the results of implementing the convnet on the lower left 14 by 14 region. And the lower right 1 by 1 by 4 volume gives you the same result as running the convnet on the lower right 14 by 14 medium. And if you step through all the steps of the calculation, let's look at the green example, if you had cropped out just this region and passed it through the convnet through the convnet on top, then the first layer's activations would have been exactly this region. The next layer's activation after max pooling would have been exactly this region and then the next layer, the next layer would have been as follows. So what this process does, what this convolution implementation does is, instead of forcing you to run four propagation on four subsets of the input image independently, Instead, it combines all four into one form of computation and shares a lot of the computation in the regions of image that are common. So all four of the 14 by 14 patches we saw here. Now let's just go through a bigger example. Let's say you now want to run sliding windows on a 28 by 28 by 3 image. It turns out If you run four from the same way then you end up with an 8 by 8 by 4 output. And just go small and surviving sliding windows with that 14 by 14 region. And that corresponds to running a sliding windows first on that region thus, giving you the output corresponding the upper left hand corner. Then using a slider too to shift one window over, one window over, one window over and so on and the eight positions. So that gives you this first row and then as you go down the image as well, that gives you all of these 8 by 8 by 4 outputs. Because of the max pooling up too that this corresponds to running your neural network with a stride of two on the original image. So just to recap, to implement sliding windows, previously, what you do is you crop out a region. Let's say this is 14 by 14 and run that through your convnet and do that for the next region over, then do that for the next 14 by 14 region, then the next one, then the next one, then the next one, then the next one and so on, until hopefully that one recognizes the car. But now, instead of doing it sequentially, with this convolutional implementation that you saw in the previous slide, you can implement the entire image, all maybe 28 by 28 and convolutionally make all the predictions at the same time by one forward pass through this big convnet and hopefully have it recognize the position of the car. So that's how you implement sliding windows convolutionally and it makes the whole thing much more efficient. Now, this [inaudible] still has one weakness, which is the position of the bounding boxes is not going to be too accurate. In the next video, let's see how you can fix that problem.

Bounding Box Predictions - 14m

0:00
In the last video, you learned how to use a convolutional implementation of sliding windows. That's more computationally efficient, but it still has a problem of not quite outputting the most accurate bounding boxes. In this video, let's see how you can get your bounding box predictions to be more accurate. With sliding windows, you take this three sets of locations and run the crossfire through it. And in this case, none of the boxes really match up perfectly with the position of the car. So, maybe that box is the best match. And also, it looks like in drawn through, the perfect bounding box isn't even quite square, it's actually has a slightly wider rectangle or slightly horizontal aspect ratio. So, is there a way to get this algorithm to outputs more accurate bounding boxes? A good way to get this output more accurate bounding boxes is with the YOLO algorithm. YOLO stands for, You Only Look Once. And is an algorithm due to Joseph Redmon, Santosh Divvala, Ross Girshick and Ali Farhadi. Here's what you do. Let's say you have an input image at 100 by 100, you're going to place down a grid on this image. And for the purposes of illustration, I'm going to use a 3 by 3 grid. Although in an actual implementation, you use a finer one, like maybe a 19 by 19 grid. And the basic idea is you're going to take the image classification and localization algorithm that you saw a few videos back, and apply it to each of the nine grids. And the basic idea is you're going to take the image classification and localization algorithm that you saw in the first video of this week and apply that to each of the nine grid cells of this image. So the more concrete, here's how you define the labels you use for training. So for each of the nine grid cells, you specify a label Y, where the label Y is this eight dimensional vector, same as you saw previously. Your first output PC 01 depending on whether or not there's an image in that grid cell and then BX, BY, BH, BW to specify the bounding box if there is an image, if there is an object associated with that grid cell. And then say, C1, C2, C3, if you try and recognize three classes not counting the background class. So you try to recognize pedestrian's class, motorcycles and the background class. Then C1 C2 C3 can be the pedestrian, car and motorcycle classes. So in this image, we have nine grid cells, so you have a vector like this for each of the grid cells. So let's start with the upper left grid cell, this one up here. For that one, there is no object. So, the label vector Y for the upper left grid cell would be zero, and then don't cares for the rest of these. The output label Y would be the same for this grid cell, and this grid cell, and all the grid cells with nothing, with no interesting object in them. Now, how about this grid cell? To give a bit more detail, this image has two objects. And what the YOLO algorithm does is it takes the midpoint of reach of the two objects and then assigns the object to the grid cell containing the midpoint. So the left car is assigned to this grid cell, and the car on the right, which is this midpoint, is assigned to this grid cell. And so even though the central grid cell has some parts of both cars, we'll pretend the central grid cell has no interesting object so that the central grid cell the class label Y also looks like this vector with no object, and so the first component PC, and then the rest are don't cares. Whereas for this cell, this cell that I have circled in green on the left, the target label Y would be as follows. There is an object, and then you write BX, BY, BH, BW, to specify the position of this bounding box. And then you have, let's see, if class one was a pedestrian, then that was zero. Class two is a car, that's one. Class three was a motorcycle, that's zero. And then similarly, for the grid cell on their right because that does have an object in it, it will also have some vector like this as the target label corresponding to the grid cell on the right. So, for each of these nine grid cells, you end up with a eight dimensional output vector. And because you have 3 by 3 grid cells, you have nine grid cells, the total volume of the output is going to be 3 by 3 by 8. So the target output is going to be 3 by 3 by 8 because you have 3 by 3 grid cells. And for each of the 3 by 3 grid cells, you have a eight dimensional Y vector. So the target output volume is 3 by 3 by 8. Where for example, this 1 by 1 by 8 volume in the upper left corresponds to the target output vector for the upper left of the nine grid cells. And so for each of the 3 by 3 positions, for each of these nine grid cells, does it correspond in eight dimensional target vector Y that you want to the output. Some of which could be don't cares, if there's no object there. And that's why the total target outputs, the output label for this image is now itself a 3 by 3 by 8 volume. So now, to train your neural network, the input is 100 by 100 by 3, that's the input image. And then you have a usual convnet with conv, layers of max pool layers, and so on. So that in the end, you have this, should choose the conv layers and the max pool layers, and so on, so that this eventually maps to a 3 by 3 by 8 output volume. And so what you do is you have an input X which is the input image like that, and you have these target labels Y which are 3 by 3 by 8, and you use map propagation to train the neural network to map from any input X to this type of output volume Y. So the advantage of this algorithm is that the neural network outputs precise bounding boxes as follows. So at test time, what you do is you feed an input image X and run forward prop until you get this output Y. And then for each of the nine outputs of each of the 3 by 3 positions in which of the output, you can then just read off 1 or 0. Is there an object associated with that one of the nine positions? And that there is an object, what object it is, and where is the bounding box for the object in that grid cell? And so long as you don't have more than one object in each grid cell, this algorithm should work okay. And the problem of having multiple objects within the grid cell is something we'll address later. Of use a relatively small 3 by 3 grid, in practice, you might use a much finer, grid maybe 19 by 19. So you end up with 19 by 19 by 8, and that also makes your grid much finer. It reduces the chance that there are multiple objects assigned to the same grid cell. And just as a reminder, the way you assign an object to grid cell as you look at the midpoint of an object and then you assign that object to whichever one grid cell contains the midpoint of the object. So each object, even if the objects spends multiple grid cells, that object is assigned only to one of the nine grid cells, or one of the 3 by 3, or one of the 19 by 19 grid cells. Algorithm of a 19 by 19 grid, the chance of an object of two midpoints of objects appearing in the same grid cell is just a bit smaller. So notice two things, first, this is a lot like the image classification and localization algorithm that we talked about in the first video of this week. And that it outputs the bounding balls coordinates explicitly. And so this allows in your network to output bounding boxes of any aspect ratio, as well as, output much more precise coordinates that aren't just dictated by the stripe size of your sliding windows classifier. And second, this is a convolutional implementation and you're not implementing this algorithm nine times on the 3 by 3 grid or if you're using a 19 by 19 grid.19 squared is 361. So, you're not running the same algorithm 361 times or 19 squared times. Instead, this is one single convolutional implantation, where you use one consonant with a lot of shared computation between all the computations needed for all of your 3 by 3 or all of your 19 by 19 grid cells. So, this is a pretty efficient algorithm. And in fact, one nice thing about the YOLO algorithm, which is constant popularity is because this is a convolutional implementation, it actually runs very fast. So this works even for real time object detection. Now, before wrapping up, there's one more detail I want to share with you, which is, how do you encode these bounding boxes bx, by, BH, BW? Let's discuss that on the next slide. So, given these two cars, remember, we have the 3 by 3 grid. Let's take the example of the car on the right. So, in this grid cell there is an object and so the target label y will be one, that was PC is equal to one. And then bx, by, BH, BW, and then 0 1 0. So, how do you specify the bounding box? In the YOLO algorithm, relative to this square, when I take the convention that the upper left point here is 0 0 and this lower right point is 1 1. So to specify the position of that midpoint, that orange dot, bx might be, let's say x looks like is about 0.4. Maybe its about 0.4 of the way to their right. And then y, looks I guess maybe 0.3. And then the height of the bounding box is specified as a fraction of the overall width of this box. So, the width of this red box is maybe 90% of that blue line. And so BH is 0.9 and the height of this is maybe one half of the overall height of the grid cell. So in that case, BW would be, let's say 0.5. So, in other words, this bx, by, BH, BW as specified relative to the grid cell. And so bx and by, this has to be between 0 and 1, right? Because pretty much by definition that orange dot is within the bounds of that grid cell is assigned to. If it wasn't between 0 and 1 it was outside the square, then we'll have been assigned to a different grid cell. But these could be greater than one. In particular if you have a car where the bounding box was that, then the height and width of the bounding box, this could be greater than one. So, there are multiple ways of specifying the bounding boxes, but this would be one convention that's quite reasonable. Although, if you read the YOLO research papers, the YOLO research line there were other parameterizations that work even a little bit better, but I hope this gives one reasonable condition that should work okay. Although, there are some more complicated parameterizations involving sigmoid functions to make sure this is between 0 and 1. And using an explanation parameterization to make sure that these are non-negative, since 0.9, 0.5, this has to be greater or equal to zero. There are some other more advanced parameterizations that work things a little bit better, but the one you saw here should work okay. So, that's it for the YOLO or the You Only Look Once algorithm. And in the next few videos I'll show you a few other ideas that will help make this algorithm even better. In the meantime, if you want, you can take a look at YOLO paper reference at the bottom of these past couple slides I use. Although, just one warning, if you take a look at these papers which is the YOLO paper is one of the harder papers to read. I remember, when I was reading this paper for the first time, I had a really hard time figuring out what was going on. And I wound up asking a couple of my friends, very good researchers to help me figure it out, and even they had a hard time understanding some of the details of the paper. So, if you look at the paper, it's okay if you have a hard time figuring it out. I wish it was more uncommon, but it's not that uncommon, sadly, for even senior researchers, that review research papers and have a hard time figuring out the details. And have to look at open source code, or contact the authors, or something else to figure out the details of these outcomes. But don't let me stop you from taking a look at the paper yourself though if you wish, but this is one of the harder ones. So, that though, you now understand the basics of the YOLO algorithm. Let's go on to some additional pieces that will make this algorithm work even better.

Intersection Over Union - 4m

0:00
So how do you tell if your object detection algorithm is working well? In this video, you'll learn about a function called, "Intersection Over Union". And as we use both for evaluating your object detection algorithm, as well as in the next video, using it to add another component to your object detection algorithm, to make it work even better. Let's get started. In the object detection task, you expected to localize the object as well. So if that's the ground-truth bounding box, and if your algorithm outputs this bounding box in purple, is this a good outcome or a bad one? So what the intersection over union function does, or IoU does, is it computes the intersection over union of these two bounding boxes. So, the union of these two bounding boxes is this area, is really the area that is contained in either bounding boxes, whereas the intersection is this smaller region here. So what the intersection of a union does is it computes the size of the intersection. So that orange shaded area, and divided by the size of the union, which is that green shaded area. And by convention, the low compute division task will judge that your answer is correct if the IoU is greater than 0.5. And if the predicted and the ground-truth bounding boxes overlapped perfectly, the IoU would be one, because the intersection would equal to the union. But in general, so long as the IoU is greater than or equal to 0.5, then the answer will look okay, look pretty decent. And by convention, very often 0.5 is used as a threshold to judge as whether the predicted bounding box is correct or not. This is just a convention. If you want to be more stringent, you can judge an answer as correct, only if the IoU is greater than equal to 0.6 or some other number. But the higher the IoUs, the more accurate the bounding the box. And so, this is one way to map localization, to accuracy where you just count up the number of times an algorithm correctly detects and localizes an object where you could use a definition like this, of whether or not the object is correctly localized. And again 0.5 is just a human chosen convention. There's no particularly deep theoretical reason for it. You can also choose some other threshold like 0.6 if you want to be more stringent. I sometimes see people use more stringent criteria like 0.6 or maybe 0.7. I rarely see people drop the threshold below 0.5. Now, what motivates the definition of IoU, as a way to evaluate whether or not your object localization algorithm is accurate or not. But more generally, IoU is a measure of the overlap between two bounding boxes. Where if you have two boxes, you can compute the intersection, compute the union, and take the ratio of the two areas. And so this is also a way of measuring how similar two boxes are to each other. And we'll see this use again this way in the next video when we talk about non-max suppression. So that's it for IoU or Intersection over Union. Not to be confused with the promissory note concept in IoU, where if you lend someone money they write you a note that says, " Oh I owe you this much money," so that's also called an IoU. It's totally a different concept, that maybe it's cool that these two things have a similar name. So now, onto this definition of IoU, Intersection of Union. In the next video, I want to discuss with you non-max suppression, which is a tool you can use to make the outputs of YOLO work even better. So let's go on to the next video.

Non-max Suppression - 8m

0:00
One of the problems of Object Detection as you've learned about this so far, is that your algorithm may find multiple detections of the same objects. Rather than detecting an object just once, it might detect it multiple times. Non-max suppression is a way for you to make sure that your algorithm detects each object only once. Let's go through an example. Let's say you want to detect pedestrians, cars, and motorcycles in this image. You might place a grid over this, and this is a 19 by 19 grid. Now, while technically this car has just one midpoint, so it should be assigned just one grid cell. And the car on the left also has just one midpoint, so technically only one of those grid cells should predict that there is a car. In practice, you're running an object classification and localization algorithm for every one of these split cells. So it's quite possible that this split cell might think that the center of a car is in it, and so might this, and so might this, and for the car on the left as well. Maybe not only this box, if this is a test image you've seen before, not only that box might decide things that's on the car, maybe this box, and this box and maybe others as well will also think that they've found the car. Let's step through an example of how non-max suppression will work. So, because you're running the image classification and localization algorithm on every grid cell, on 361 grid cells, it's possible that many of them will raise their hand and say, "My Pc, my chance of thinking I have an object in it is large." Rather than just having two of the grid cells out of the 19 squared or 361 think they have detected an object. So, when you run your algorithm, you might end up with multiple detections of each object. So, what non-max suppression does, is it cleans up these detections. So they end up with just one detection per car, rather than multiple detections per car. So concretely, what it does, is it first looks at the probabilities associated with each of these detections. Canada Pcs, although there are some details you'll learn about in this week's problem exercises, is actually Pc times C1, or C2, or C3. But for now, let's just say is Pc with the probability of a detection. And it first takes the largest one, which in this case is 0.9 and says, "That's my most confident detection, so let's highlight that and just say I found the car there." Having done that the non-max suppression part then looks at all of the remaining rectangles and all the ones with a high overlap, with a high IOU, with this one that you've just output will get suppressed. So those two rectangles with the 0.6 and the 0.7. Both of those overlap a lot with the light blue rectangle. So those, you are going to suppress and darken them to show that they are being suppressed. Next, you then go through the remaining rectangles and find the one with the highest probability, the highest Pc, which in this case is this one with 0.8. So let's commit to that and just say, "Oh, I've detected a car there." And then, the non-max suppression part is to then get rid of any other ones with a high IOU. So now, every rectangle has been either highlighted or darkened. And if you just get rid of the darkened rectangles, you are left with just the highlighted ones, and these are your two final predictions. So, this is non-max suppression. And non-max means that you're going to output your maximal probabilities classifications but suppress the close-by ones that are non-maximal. Hence the name, non-max suppression. Let's go through the details of the algorithm. First, on this 19 by 19 grid, you're going to get a 19 by 19 by eight output volume. Although, for this example, I'm going to simplify it to say that you only doing car detection. So, let me get rid of the C1, C2, C3, and pretend for this line, that each output for each of the 19 by 19, so for each of the 361, which is 19 squared, for each of the 361 positions, you get an output prediction of the following. Which is the chance there's an object, and then the bounding box. And if you have only one object, there's no C1, C2, C3 prediction. The details of what happens, you have multiple objects, I'll leave to the programming exercise, which you'll work on towards the end of this week. Now, to intimate non-max suppression, the first thing you can do is discard all the boxes, discard all the predictions of the bounding boxes with Pc less than or equal to some threshold, let's say 0.6. So we're going to say that unless you think there's at least a 0.6 chance it is an object there, let's just get rid of it. This has caused all of the low probability output boxes. The way to think about this is for each of the 361 positions, you output a bounding box together with a probability of that bounding box being a good one. So we're just going to discard all the bounding boxes that were assigned a low probability. Next, while there are any remaining bounding boxes that you've not yet discarded or processed, you're going to repeatedly pick the box with the highest probability, with the highest Pc, and then output that as a prediction. So this is a process on a previous slide of taking one of the bounding boxes, and making it lighter in color. So you commit to outputting that as a prediction for that there is a car there. Next, you then discard any remaining box. Any box that you have not output as a prediction, and that was not previously discarded. So discard any remaining box with a high overlap, with a high IOU, with the box that you just output in the previous step. This second step in the while loop was when on the previous slide you would darken any remaining bounding box that had a high overlap with the bounding box that we just made lighter, that we just highlighted. And so, you keep doing this while there's still any remaining boxes that you've not yet processed, until you've taken each of the boxes and either output it as a prediction, or discarded it as having too high an overlap, or too high an IOU, with one of the boxes that you have just output as your predicted position for one of the detected objects.
7:00
I've described the algorithm using just a single object on this slide. If you actually tried to detect three objects say pedestrians, cars, and motorcycles, then the output vector will have three additional components. And it turns out, the right thing to do is to independently carry out non-max suppression three times, one on each of the outputs classes. But the details of that, I'll leave to this week's program exercise where you get to implement that yourself, where you get to implement non-max suppression yourself on multiple object classes. So that's it for non-max suppression, and if you implement the Object Detection algorithm we've described, you actually get pretty decent results. But before wrapping up our discussion of the YOLO algorithm, there's just one last idea I want to share with you, which makes the algorithm work much better, which is the idea of using anchor boxes. Let's go on to the next video.

Anchor Boxes - 9m

0:00
One of the problems with object detection as you have seen it so far is that each of the grid cells can detect only one object. What if a grid cell wants to detect multiple objects? Here is what you can do. You can use the idea of anchor boxes. Let's start with an example. Let's say you have an image like this. And for this example, I am going to continue to use a 3 by 3 grid. Notice that the midpoint of the pedestrian and the midpoint of the car are in almost the same place and both of them fall into the same grid cell. So, for that grid cell, if Y outputs this vector where you are detecting three causes, pedestrians, cars and motorcycles, it won't be able to output two detections. So I have to pick one of the two detections to output. With the idea of anchor boxes, what you are going to do, is pre-define two different shapes called, anchor boxes or anchor box shapes. And what you are going to do is now, be able to associate two predictions with the two anchor boxes. And in general, you might use more anchor boxes, maybe five or even more. But for this video, I am just going to use two anchor boxes just to make the description easier. So what you do is you define the cross label to be, instead of this vector on the left, you basically repeat this twice. S, you will have PC, PX, PY, PH, PW, C1, C2, C3, and these are the eight outputs associated with anchor box 1. And then you repeat that PC, PX and so on down to C1, C2, C3, and other eight outputs associated with anchor box 2. So, because the shape of the pedestrian is more similar to the shape of anchor box 1 and anchor box 2, you can use these eight numbers to encode that PC as one, yes there is a pedestrian. Use this to encode the bounding box around the pedestrian, and then use this to encode that that object is a pedestrian. And then because the box around the car is more similar to the shape of anchor box 2 than anchor box 1, you can then use this to encode that the second object here is the car, and have the bounding box and so on be all the parameters associated with the detected car. So to summarize, previously, before you are using anchor boxes, you did the following, which is for each object in the training set and the training set image, it was assigned to the grid cell that corresponds to that object's midpoint. And so the output Y was 3 by 3 by 8 because you have a 3 by 3 grid. And for each grid position, we had that output vector which is PC, then the bounding box, and C1, C2, C3. With the anchor box, you now do that following. Now, each object is assigned to the same grid cell as before, assigned to the grid cell that contains the object's midpoint, but it is assigned to a grid cell and anchor box with the highest IoU with the object's shape. So, you have two anchor boxes, you will take an object and see. So if you have an object with this shape, what you do is take your two anchor boxes. Maybe one anchor box is this this shape that's anchor box 1, maybe anchor box 2 is this shape, and then you see which of the two anchor boxes has a higher IoU, will be drawn through bounding box. And whichever it is, that object then gets assigned not just to a grid cell but to a pair. It gets assigned to grid cell comma anchor box pair. And that's how that object gets encoded in the target label. And so now, the output Y is going to be 3 by 3 by 16. Because as you saw on the previous slide, Y is now 16 dimensional. Or if you want, you can also view this as 3 by 3 by 2 by 8 because there are now two anchor boxes and Y is eight dimensional. And dimension of Y being eight was because we have three objects causes if you have more objects than the dimension of Y would be even higher. So let's go through a complete example. For this grid cell, let's specify what is Y. So the pedestrian is more similar to the shape of anchor box 1. So for the pedestrian, we're going to assign it to the top half of this vector. So yes, there is an object, there will be some bounding box associated at the pedestrian. And I guess if a pedestrian is cos one, then we see one as one, and then zero, zero. And then the shape of the car is more similar to anchor box 2. And so the rest of this vector will be one and then the bounding box associated with the car, and then the car is C2, so there's zero, one, zero. And so that's the label Y for that lower middle grid cell that this arrow was pointing to. Now, what if this grid cell only had a car and had no pedestrian? If it only had a car, then assuming that the shape of the bounding box around the car is still more similar to anchor box 2, then the target label Y, if there was just a car there and the pedestrian had gone away, it will still be the same for the anchor box 2 component. Remember that this is a part of the vector corresponding to anchor box 2. And for the part of the vector corresponding to anchor box 1, what you do is you just say there is no object there. So PC is zero, and then the rest of these will be don't cares. Now, just some additional details. What if you have two anchor boxes but three objects in the same grid cell? That's one case that this algorithm doesn't handle well. Hopefully, it won't happen. But if it does, this algorithm doesn't have a great way of handling it. I will just influence some default tiebreaker for that case. Or what if you have two objects associated with the same grid cell, but both of them have the same anchor box shape? Again, that's another case that this algorithm doesn't handle well. If you influence some default way of tiebreaking if that happens, hopefully this won't happen with your data set, it won't happen much at all. And so, it shouldn't affect performance as much. So, that's it for anchor boxes. And even though I'd motivated anchor boxes as a way to deal with what happens if two objects appear in the same grid cell, in practice, that happens quite rarely, especially if you use a 19 by 19 rather than a 3 by 3 grid. The chance of two objects having the same midpoint rather these 361 cells, it does happen, but it doesn't happen that often. Maybe even better motivation or even better results that anchor boxes gives you is it allows your learning algorithm to specialize better. In particular, if your data set has some tall, skinny objects like pedestrians, and some white objects like cars, then this allows your learning algorithm to specialize so that some of the outputs can specialize in detecting white, fat objects like cars, and some of the output units can specialize in detecting tall, skinny objects like pedestrians. So finally, how do you choose the anchor boxes? And people used to just choose them by hand or choose maybe five or 10 anchor box shapes that spans a variety of shapes that seems to cover the types of objects you seem to detect. As a much more advanced version, just in the advance common for those of who have other knowledge in machine learning, and even better way to do this in one of the later YOLO research papers, is to use a K-means algorithm, to group together two types of objects shapes you tend to get. And then to use that to select a set of anchor boxes that this most stereotypically representative of the maybe multiple, of the maybe dozens of object causes you're trying to detect. But that's a more advanced way to automatically choose the anchor boxes. And if you just choose by hand a variety of shapes that reasonably expands the set of object shapes, you expect to detect some tall, skinny ones, some fat, white ones. That should work with these as well. So that's it for anchor boxes. In the next video, let's take everything we've seen and tie it back together into the YOLO algorithm.

YOLO Algorithm - 7m

0:00
You've already seen most of the components of object detection. In this video, let's put all the components together to form the YOLO object detection algorithm. First, let's see how you construct your training set. Suppose you're trying to train an algorithm to detect three objects: pedestrians, cars, and motorcycles. And you will need to explicitly have the full background class, so just the class labels here. If you're using two anchor boxes, then the outputs y will be three by three because you are using three by three grid cell, by two, this is the number of anchors, by eight because that's the dimension of this. Eight is actually five which is plus the number of classes. So five because you have Pc and then the bounding boxes, that's five, and then C1, C2, C3. That dimension is equal to the number of classes. And you can either view this as three by three by two by eight, or by three by three by sixteen. So to construct the training set, you go through each of these nine grid cells and form the appropriate target vector y. So take this first grid cell, there's nothing worth detecting in that grid cell. None of the three classes pedestrian, car and motocycle, appear in the upper left grid cell and so, the target y corresponding to that grid cell would be equal to this. Where Pc for the first anchor box is zero because there's nothing associated for the first anchor box, and is also zero for the second anchor box and so on all of these other values are don't cares. Now, most of the grid cells have nothing in them, but for that box over there, you would have this target vector y. So assuming that your training set has a bounding box like this for the car, it's just a little bit wider than it is tall. And so if your anchor boxes are that, this is a anchor box one, this is anchor box two, then the red box has just slightly higher IoU with anchor box two. And so the car gets associated with this lower portion of the vector. So notice then that Pc associate anchor box one is zero. So you have don't cares all these components. Then you have this Pc is equal to one, then you should use these to specify the position of the red bounding box, and then specify that the correct object is class two. Right that it is a car. So you go through this and for each of your nine grid positions each of your three by three grid positions, you would come up with a vector like this. Come up with a 16 dimensional vector. And so that's why the final output volume is going to be 3 by 3 by 16. Oh and as usual for simplicity on the slide I've used a 3 by 3 the grid. In practice it might be more like a 19 by 19 by 16. Or in fact if you use more anchor boxes, maybe 19 by 19 by 5 x 8 because five times eight is 40. So it will be 19 by 19 by 40. That's if you use five anchor boxes. So that's training and you train ConvNet that inputs an image, maybe 100 by 100 by 3, and your ConvNet would then finally output this output volume in our example, 3 by 3 by 16 or 3 by 3 by 2 by 8. Next, let's look at how your algorithm can make predictions. Given an image, your neural network will output this by 3 by 3 by 2 by 8 volume, where for each of the nine grid cells you get a vector like that. So for the grid cell here on the upper left, if there's no object there, hopefully, your neural network will output zero here, and zero here, and it will output some other values. Your neural network can't output a question mark, can't output a don't care. So I'll put some numbers for the rest. But these numbers will basically be ignored because the neural network is telling you that there's no object there. So it doesn't really matter whether the output is a bounding box or there's is a car. So basically just be some set of numbers, more or less noise. In contrast, for this box over here hopefully, the value of y to the output for that box at the bottom left, hopefully would be something like zero for bounding box one. And then just open a bunch of numbers, just noise. Hopefully, you'll also output a set of numbers that corresponds to specifying a pretty accurate bounding box for the car. So that's how the neural network will make predictions. Finally, you run this through non-max suppression. So just to make it interesting. Let's look at the new test set image. Here's how you would run non-max suppression. If you're using two anchor boxes, then for each of the non-grid cells, you get two predicted bounding boxes. Some of them will have very low probability, very low Pc, but you still get two predicted bounding boxes for each of the nine grid cells. So let's say, those are the bounding boxes you get. And notice that some of the bounding boxes can go outside the height and width of the grid cell that they came from. Next, you then get rid of the low probability predictions. So get rid of the ones that even the neural network says, gee this object probably isn't there. So get rid of those. And then finally if you have three classes you're trying to detect, you're trying to detect pedestrians, cars and motorcycles. What you do is, for each of the three classes, independently run non-max suppression for the objects that were predicted to come from that class. But use non-max suppression for the predictions of the pedestrians class, run non-max suppression for the car class, and non-max suppression for the motorcycle class. But run that basically three times to generate the final predictions. And so the output of this is hopefully that you will have detected all the cars and all the pedestrians in this image. So that's it for the YOLO object detection algorithm. Which is really one of the most effective object detection algorithms, that also encompasses many of the best ideas across the entire computer vision literature that relate to object detection. And you get a chance to practice implementing many components of this yourself, in this week's problem exercise. So I hope you enjoy this week's problem exercise. There's also an optional video that follows this one which you can either watch or not watch as you please. But either way I also look forward to seeing you next week.

(Optional) Region Proposals - 6m

0:00
If you look at the object detection literature, there's a set of ideas called region proposals that's been very influential in computer vision as well. I wanted to make this video optional because I tend to use the region proposal instead of algorithm a bit less often but nonetheless, it has been an influential body of work and an idea that you might come across in your own work. Let's take a look. So if you recall the sliding windows idea, you would take a train crossfire and run it across all of these different windows and run the detector to see if there's a car, pedestrian, or maybe a motorcycle. Now, you could run the algorithm convolutionally, but one downside that the algorithm is it just crossfires a lot of the regions where there's clearly no object. So this rectangle down here is pretty much blank. It's clearly nothing interesting there to classify, and maybe it was also running it on this rectangle, which look likes there's nothing that interesting there. So what Russ Girshik, Jeff Donahue, Trevor Darrell, and Jitendra Malik proposed in the paper, as cited to the bottom of the slide, is an algorithm called R-CNN, which stands for Regions with convolutional networks or regions with CNNs. And what that does is it tries to pick just a few regions that makes sense to run your continent crossfire. So rather than running your sliding windows on every single window, you instead select just a few windows and run your continent crossfire on just a few windows. The way that they perform the region proposals is to run an algorithm called a segmentation algorithm, that results in this output on the right, in order to figure out what could be objects. So, for example, the segmentation algorithm finds a blob over here. And so you might pick that pounding balls and say, "Let's run a crossfire on that blob." It looks like this little green thing finds a blob there, as you might also run the crossfire on that rectangle to see if there's some interesting there. And in this case, this blue blob, if you run a crossfire on that, hope you find the pedestrian, and if you run it on this light cyan blob, maybe you'll find a car, maybe not,. I'm not sure. So the details of this, this is called a segmentation algorithm, and what you do is you find maybe 2000 blobs and place bounding boxes around about 2000 blobs and value crossfire on just those 2000 blobs, and this can be a much smaller number of positions on which to run your continent crossfire, then if you have to run it at every single position throughout the image. And this is a special case if you are running your continent not just on square-shaped regions but running them on tall skinny regions to try to find pedestrians or running them on your white fat regions try to find cars and running them at multiple scales as well. So that's the R-CNN or the region with CNN, a region of CNN features idea. Now, it turns out the R-CNN algorithm is still quite slow. So there's been a line of work to explore how to speed up this algorithm. So the basic R-CNN algorithm with proposed regions using some algorithm and then crossfire the proposed regions one at a time. And for each of the regions, they will output the label. So is there a car? Is there a pedestrian? Is there a motorcycle there? And then also outputs a bounding box, so you can get an accurate bounding box if indeed there is a object in that region. So just to be clear, the R-CNN algorithm doesn't just trust the bounding box it was given. It also outputs a bounding box, B X B Y B H B W, in order to get a more accurate bounding box and whatever happened to surround the blob that the image segmentation algorithm gave it. So it can get pretty accurate bounding boxes. Now, one downside of the R-CNN algorithm was that it is actually quite slow. So over the years, there been a few improvements to the R-CNN algorithm. Russ Girshik proposed the fast R-CNN algorithm, and it's basically the R-CNN algorithm but with a convolutional implementation of sliding windows. So the original implementation would actually classify the regions one at a time. So far, R-CNN use a convolutional implementation of sliding windows, and this is roughly similar to the idea you saw in the fourth video of this week. And that speeds up R-CNN quite a bit. It turns out that one of the problems of fast R-CNN algorithm is that the clustering step to propose the regions is still quite slow and so a different group, Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Son, proposed the faster R-CNN algorithm, which uses a convolutional neural network instead of one of the more traditional segmentation algorithms to propose a blob on those regions, and that wound up running quite a bit faster than the fast R-CNN algorithm. Although, I think the faster R-CNN algorithm, most implementations are usually still quit a bit slower than the YOLO algorithm. So the idea of region proposals has been quite influential in computer vision, and I wanted you to know about these ideas because you see others still used these ideas, for myself, and this is my personal opinion, not the opinion of the computer vision research committee as a whole. I think that we can propose an interesting idea but that not having two steps, first, proposed region and then crossfire, being able to do everything more or at the same time, similar to the YOLO or the You Only Look Once algorithm that seems to me like a more promising direction for the long term. But that's my personal opinion and not necessary the opinion of the whole computer vision research committee. So feel free to take that with a grain of salt, but I think that the R-CNN idea, you might come across others using it. So it was worth learning as well so you can understand others algorithms better. So we're now finished up our material for this week on object detection. I hope you enjoy working on this week's problem exercise, and I look forward to seeing you this week.

Special applications: Face recognition & Neural style transfer

Discover how CNNs can be applied to multiple fields, including art generation and face recognition. Implement your own algorithm to generate art and recognize faces!

What is face recognition? - 4m

0:00
Hi, and welcome to this fourth and final week of this course on convolutional neural networks. By now, you've learned a lot about confidence. What I want to do this week is show you a couple important special applications of confidence. We'll start the face recognition, and then go on later this week to neurosal transfer, which you get to implement in the problem exercise as well to create your own artwork. But first, let's start the face recognition and just for fun, I want to show you a demo. When I was leading by those AI group, one of the teams I worked with led by Yuanqing Lin had built a face recognition system that I thought is really cool. Let's take a look. So, I'm going to play this video here, but I can also get whoever is editing this raw video configure out to this better to splice in the raw video or take the one I'm playing here. I want to show you a face recognition demo. I'm in Baidu's headquarters in China. Most companies require that to get inside, you swipe an ID card like this one but here we don't need that. Using face recognition, check what I can do. When I walk up, it recognizes my face, it says, "Welcome Andrew," and I just walk right through without ever having to use my ID card. Let me show you something else. I'm actually here with Lin Yuanqing, the director of IDL which developed all of this face recognition technology. I'm gonna hand him my ID card, which has my face printed on it, and he's going to use it to try to sneak in using my picture instead of a live human. I'm gonna use Andrew's card and try to sneak in and see what happens. So the system is not recognizing it, it refuses to recognize. Okay. Now, I'm going to use my own face. So face recognition technology like this is taking off very rapidly in China and I hope that this type of technology soon makes it way to other countries.. So, pretty cool, right? The video you just saw demoed both face recognition as well as liveness detection. The latter meaning making sure that you are a live human. It turns out liveness detection can be implemented using supervised learning as well to predict live human versus not live human but I want to spend less time on that. Instead, I want to focus our time on talking about how to build the face recognition portion of the system. First, let's start by going over some of the terminology used in face recognition. In the face recognition literature, people often talk about face verification and face recognition. This is the face verification problem which is if you're given an input image as well as a name or ID of a person and the job of the system is to verify whether or not the input image is that of the claimed person. So, sometimes this is also called a one to one problem where you just want to know if the person is the person they claim to be. So, the recognition problem is much harder than the verification problem. To see why, let's say, you have a verification system that's 99 percent accurate. So, 99 percent might not be too bad but now suppose that K is equal to 100 in a recognition system. If you apply this system to a recognition task with a 100 people in your database, you now have a hundred times of chance of making a mistake and if the chance of making mistakes on each person is just one percent. So, if you have a database of a 100 persons and if you want an acceptable recognition error, you might actually need a verification system with maybe 99.9 or even higher accuracy before you can run it on a database of 100 persons that have a high chance and still have a high chance of getting incorrect. In fact, if you have a database of 100 persons currently just be even quite a bit higher than 99 percent for that to work well. But what we do in the next few videos is focus on building a face verification system as a building block and then if the accuracy is high enough, then you probably use that in a recognition system as well. So in the next video, we'll start describing how you can build a face verification system. It turns out one of the reasons that is a difficult problem is you need to solve a one shot learning problem. Let's see in the next video what that means.

One Shot Learning - 4m

0:00
One of the challenges of face recognition is that you need to solve the one-shot learning problem. What that means is that for most face recognition applications you need to be able to recognize a person given just one single image, or given just one example of that person's face. And, historically, deep learning algorithms don't work well if you have only one training example. Let's see an example of what this means and talk about how to address this problem. Let's say you have a database of four pictures of employees in you're organization. These are actually some of my colleagues at Deeplearning.AI, Khan, Danielle, Younes and Thian. Now let's say someone shows up at the office and they want to be let through the turnstile. What the system has to do is, despite ever having seen only one image of Danielle, to recognize that this is actually the same person. And, in contrast, if it sees someone that's not in this database, then it should recognize that this is not any of the four persons in the database. So in the one shot learning problem, you have to learn from just one example to recognize the person again. And you need this for most face recognition systems because you might have only one picture of each of your employees or of your team members in your employee database. So one approach you could try is to input the image of the person, feed it too a ConvNet. And have it output a label, y, using a softmax unit with four outputs or maybe five outputs corresponding to each of these four persons or none of the above. So that would be 5 outputs in the softmax. But this really doesn't work well. Because if you have such a small training set it is really not enough to train a robust neural network for this task. And also what if a new person joins your team? So now you have 5 persons you need to recognize, so there should now be six outputs. Do you have to retrain the ConvNet every time? That just doesn't seem like a good approach. So to carry out face recognition, to carry out one-shot learning. So instead, to make this work, what you're going to do instead is learn a similarity function. In particular, you want a neural network to learn a function which going to denote d, which inputs two images and outputs the degree of difference between the two images. So if the two images are of the same person, you want this to output a small number. And if the two images are of two very different people you want it to output a large number. So during recognition time, if the degree of difference between them is less than some threshold called tau, which is a hyperparameter. Then you would predict that these two pictures are the same person. And if it is greater than tau, you would predict that these are different persons.
3:06
And so this is how you address the face verification problem. To use this for a recognition task, what you do is, given this new picture, you will use this function d to compare these two images. And maybe I'll output a very large number, let's say 10, for this example. And then you compare this with the second image in your database. And because these two are the same person, hopefully you output a very small number. You do this for the other images in your database and so on.
3:43
And based on this, you would figure out that this is actually that person, which is Danielle. And in contrast, if someone not in your database shows up, as you use the function d to make all of these pairwise comparisons, hopefully d will output have a very large number for all four pairwise comparisons. And then you say that this is not any one of the four persons in the database. Notice how this allows you to solve the one-shot learning problem. So long as you can learn this function d, which inputs a pair of images and tells you, basically, if they're the same person or different persons. Then if you have someone new join your team, you can add a fifth person to your database, and it just works fine.
4:30
So you've seen how learning this function d, which inputs two images, allows you to address the one-shot learning problem. In the next video, let's take a look at how you can actually train the neural network to learn dysfunction d.

Siamese Network - 4m

0:00
The job of the function d, which you learned about in the last video, is to input two faces and tell you how similar or how different they are. A good way to do this is to use a Siamese network. Let's take a look.
0:15
You're used to seeing pictures of confidence like these where you input an image, let's say x1. And through a sequence of convolutional and pulling and fully connected layers, end up with a feature vector like that.
0:30
And sometimes this is fed to a softmax unit to make a classification. We're not going to use that in this video. Instead, we're going to focus on this vector of let's say 128 numbers computed by some fully connected layer that is deeper in the network.
0:50
And I'm going to give this list of 128 numbers a name. I'm going to call this f of x1, and you should think of f of x1 as an encoding of the input image x1. So it's taken the input image, here this picture of Kian, and is re-representing it as a vector of 128 numbers. The way you can build a face recognition system is then that if you want to compare two pictures, let's say this first picture with this second picture here. What you can do is feed this second picture to the same neural network with the same parameters and get a different vector of 128 numbers, which encodes this second picture. So I'm going to call this second picture.
1:44
So I'm going to call this encoding of this second picture f of x2, and here I'm using x1 and x2 just to denote two input images. They don't necessarily have to be the first and second examples in your training sets. It can be any two pictures. Finally, if you believe that these encodings are a good representation of these two images, what you can do is then define the image
2:12
d of distance between x1 and x2 as the norm of the difference between the encodings of these two images.
2:26
So this idea of running two identical, convolutional neural networks on two different inputs and then comparing them, sometimes that's called a Siamese neural network architecture. And a lot of the ideas I'm presenting here came from this paper due to Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, and Lior Wolf in the research system that they developed called DeepFace.
2:54
And many of the ideas I'm presenting here came from a paper due to Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, and Lior Wolf in a system that they developed called DeepFace.
3:08
So how do you train this Siamese neural network? Remember that these two neural networks have the same parameters.
3:16
So what you want to do is really train the neural network so that the encoding that it computes results in a function d that tells you when two pictures are of the same person. So more formally, the parameters of the neural network define an encoding f of xi. So given any input image xi, the neural network outputs this 128 dimensional encoding f of xi. So more formally, what you want to do is learn parameters so that if two pictures, xi and xj, are of the same person, then you want that distance between their encodings to be small. And in the previous slide, l was using x1 and x2, but it's really any pair xi and xj from your training set. And in contrast, if xi and xj are of different persons, then you want that distance between their encodings to be large. So as you vary the parameters in all of these layers of the neural network, you end up with different encodings. And what you can do is use back propagation and vary all those parameters in order to make sure these conditions are satisfied. So you've learned about the Siamese network architecture and have a sense of what you want the neural network to output for you in terms of what would make a good encoding. But how do you actually define an objective function to make a neural network learn to do what we just discussed here? Let's see how you can do that in the next video using the triplet loss function.

Triplet Loss - 15m

0:00
One way to learn the parameters of the neural network so that it gives you a good encoding for your pictures of faces is to define an applied gradient descent on the triplet loss function. Let's see what that means. To apply the triplet loss, you need to compare pairs of images. For example, given this picture, to learn the parameters of the neural network, you have to look at several pictures at the same time. For example, given this pair of images, you want their encodings to be similar because these are the same person. Whereas, given this pair of images, you want their encodings to be quite different because these are different persons. In the terminology of the triplet loss, what you're going do is always look at one anchor image and then you want to distance between the anchor and the positive image, really a positive example, meaning as the same person to be similar. Whereas, you want the anchor when pairs are compared to the negative example for their distances to be much further apart. So, this is what gives rise to the term triplet loss, which is that you'll always be looking at three images at a time. You'll be looking at an anchor image, a positive image, as well as a negative image. And I'm going to abbreviate anchor positive and negative as A, P, and N. So to formalize this, what you want is for the parameters of your neural network of your encodings to have the following property, which is that you want the encoding between the anchor minus the encoding of the positive example, you want this to be small and in particular, you want this to be less than or equal to the distance of the squared norm between the encoding of the anchor and the encoding of the negative, where of course, this is d of A, P and this is d of A, N. And you can think of d as a distance function, which is why we named it with the alphabet d. Now, if we move to term from the right side of this equation to the left side, what you end up with is f of A minus f of P squared minus, let's take the right-hand side now, minus F of N squared, you want this to be less than or equal to zero. But now, we're going to make a slight change to this expression, which is one trivial way to make sure this is satisfied, is to just learn everything equals zero. If f always equals zero, then this is zero minus zero, which is zero, this is zero minus zero which is zero. And so, well, by saying f of any image equals a vector of all zeroes, you can almost trivially satisfy this equation. So, to make sure that the neural network doesn't just output zero for all the encoding, so to make sure that it doesn't set all the encodings equal to each other. Another way for the neural network to give a trivial output is if the encoding for every image was identical to the encoding to every other image, in which case, you again get zero minus zero. So to prevent a neural network from doing that, what we're going to do is modify this objective to say that, this doesn't need to be just less than or equal to zero, it needs to be quite a bit smaller than zero. So, in particular, if we say this needs to be less than negative alpha, where alpha is another hyperparameter, then this prevents a neural network from outputting the trivial solutions. And by convention, usually, we write plus alpha instead of negative alpha there. And this is also called, a margin, which is terminology that you'd be familiar with if you've also seen the literature on support vector machines, but don't worry about it if you haven't. And we can also modify this equation on top by adding this margin parameter. So to give an example, let's say the margin is set to 0.2. If in this example, d of the anchor and the positive is equal to 0.5, then you won't be satisfied if d between the anchor and the negative was just a little bit bigger, say 0.51. Even though 0.51 is bigger than 0.5, you're saying, that's not good enough, we want a dfA, N to be much bigger than dfA, P and in particular, you want this to be at least 0.7 or higher. Alternatively, to achieve this margin or this gap of at least 0.2, you could either push this up or push this down so that there is at least this gap of this alpha, hyperparameter alpha 0.2 between the distance between the anchor and the positive versus the anchor and the negative. So that's what having a margin parameter here does, which is it pushes the anchor positive pair and the anchor negative pair further away from each other. So, let's take this equation we have here at the bottom, and on the next slide, formalize it, and define the triplet loss function. So, the triplet loss function is defined on triples of images. So, given three images, A, P, and N, the anchor positive and negative examples. So the positive examples is of the same person as the anchor, but the negative is of a different person than the anchor. We're going to define the loss as follows. The loss on this example, which is really defined on a triplet of images is, let me first copy over what we had on the previous slide. So, that was fA minus fP squared minus fA minus fN squared, and then plus alpha, the margin parameter. And what you want is for this to be less than or equal to zero. So, to define the loss function, let's take the max between this and zero. So, the effect of taking the max here is that, so long as this is less than zero, then the loss is zero, because the max is something less than equal to zero, when zero is going to be zero. So, so long as you achieve the goal of making this thing I've underlined in green, so long as you've achieved the objective of making that less than or equal to zero, then the loss on this example is equals to zero. But if on the other hand, if this is greater than zero, then if you take the max, the max we end up selecting, this thing I've underlined in green, and so you would have a positive loss. So by trying to minimize this, this has the effect of trying to send this thing to be zero, less than or equal to zero. And then, so long as there's zero or less than or equal to zero, the neural network doesn't care how much further negative it is. So, this is how you define the loss on a single triplet and the overall cost function for your neural network can be sum over a training set of these individual losses on different triplets. So, if you have a training set of say 10,000 pictures with 1,000 different persons, what you'd have to do is take your 10,000 pictures and use it to generate, to select triplets like this and then train your learning algorithm using gradient descent on this type of cost function, which is really defined on triplets of images drawn from your training set. Notice that in order to define this dataset of triplets, you do need some pairs of A and P. Pairs of pictures of the same person. So the purpose of training your system, you do need a dataset where you have multiple pictures of the same person. That's why in this example, I said if you have 10,000 pictures of 1,000 different person, so maybe have 10 pictures on average of each of your 1,000 persons to make up your entire dataset. If you had just one picture of each person, then you can't actually train this system. But of course after training, if you're applying this, but of course after having trained the system, you can then apply it to your one shot learning problem where for your face recognition system, maybe you have only a single picture of someone you might be trying to recognize. But for your training set, you do need to make sure you have multiple images of the same person at least for some people in your training set so that you can have pairs of anchor and positive images. Now, how do you actually choose these triplets to form your training set? One of the problems if you choose A, P, and N randomly from your training set subject to A and P being from the same person, and A and N being different persons, one of the problems is that if you choose them so that they're at random, then this constraint is very easy to satisfy. Because given two randomly chosen pictures of people, chances are A and N are much different than A and P. I hope you still recognize this notation, this d(A, P) was what we had written on the last year's slides as this encoding. So this is just equal to this squared known distance between the encodings that we have on the previous slide. But if A and N are two randomly chosen different persons, then there is a very high chance that this will be much bigger more than the margin alpha that that term on the left. And so, the neural network won't learn much from it. So to construct a training set, what you want to do is to choose triplets A, P, and N that are hard to train on. So in particular, what you want is for all triplets that this constraint be satisfied. So, a triplet that is hard will be if you choose values for A, P, and N so that maybe d(A, P) is actually quite close to d(A,N). So in that case, the learning algorithm has to try extra hard to take this thing on the right and try to push it up or take this thing on the left and try to push it down so that there is at least a margin of alpha between the left side and the right side. And the effect of choosing these triplets is that it increases the computational efficiency of your learning algorithm. If you choose your triplets randomly, then too many triplets would be really easy, and so, gradient descent won't do anything because your neural network will just get them right, pretty much all the time. And it's only by using hard triplets that the gradient descent procedure has to do some work to try to push these quantities further away from those quantities. And if you're interested, the details are presented in this paper by Florian Schroff, Dmitry Kalinichenko, and James Philbin, where they have a system called FaceNet, which is where a lot of the ideas I'm presenting in this video come from. By the way, this is also a fun fact about how algorithms are often named in the deep learning world, which is if you work in a certain domain, then we call that blank. You often have a system called blank net or deep blank. So, we've been talking about face recognition. So this paper is called FaceNet, and in the last video, you just saw deep face. But this idea of a blank net or deep blank is a very popular way of naming algorithms in the deep learning world. And you should feel free to take a look at that paper if you want to learn some of these other details for speeding up your algorithm by choosing the most useful triplets to train on, it is a nice paper. So, just to wrap up, to train on triplet loss, you need to take your training set and map it to a lot of triples. So, here is our triple with an anchor and a positive, both for the same person and the negative of a different person. Here's another one where the anchor and positive are of the same person but the anchor and negative are of different persons and so on. And what you do having defined this training sets of anchor positive and negative triples is use gradient descent to try to minimize the cost function J we defined on an earlier slide, and that will have the effect of that propagating to all of the parameters of the neural network in order to learn an encoding so that d of two images will be small when these two images are of the same person, and they'll be large when these are two images of different persons. So, that's it for the triplet loss and how you can train a neural network for learning and an encoding for face recognition. Now, it turns out that commercial face recognition systems are trained on fairly large datasets at this point. Often, north of the million images sometimes not in frequently north of 10 million images. And there are some commercial companies talking about using over 100 million images. So these are very large datasets by models that even balm. So that's it for the triplet loss and how you can use it to train a neural network to operate a good encoding for face recognition. Now, it turns out that today's face recognition systems especially the loss cure commercial face recognition systems are trained on very large datasets. Datasets north of a million images is not uncommon, some companies are using north of 10 million images and some companies have north of 100 million images with which to try to train these systems. So these are very large datasets even by modern standards, these dataset assets are not easy to acquire. Fortunately, some of these companies have trained these large networks and posted parameters online. So, rather than trying to train one of these networks from scratch, this is one domain where because of the share data volume sizes, this is one domain where often it might be useful for you to download someone else's pre-train model, rather than do everything from scratch yourself. But even if you do download someone else's pre-train model, I think it's still useful to know how these algorithms were trained or in case you need to apply these ideas from scratch yourself for some application. So that's it for the triplet loss. In the next video, I want to show you also some other variations on siamese networks and how to train these systems. Let's go onto the next video.

Face Verification and Binary Classification - 6m

0:00
The Triplet Loss is one good way to learn the parameters of a continent for face recognition. There's another way to learn these parameters. Let me show you how face recognition can also be posed as a straight binary classification problem. Another way to train a neural network, is to take this pair of neural networks to take this Siamese Network and have them both compute these embeddings, maybe 128 dimensional embeddings, maybe even higher dimensional, and then have these be input to a logistic regression unit to then just make a prediction. Where the target output will be one if both of these are the same persons, and zero if both of these are of different persons. So, this is a way to treat face recognition just as a binary classification problem. And this is an alternative to the triplet loss for training a system like this. Now, what does this final logistic regression unit actually do? The output y hat will be a sigmoid function, applied to some set of features but rather than just feeding in, these encodings, what you can do is take the differences between the encodings. So, let me show you what I mean. Let's say, I write a sum over K equals 1 to 128 of the absolute value, taken element wise between the two different encodings. Let me just finish writing this out and then we'll see what this means. In this notation, f of x i is the encoding of the image x i and the substitute k means to just select out the cave components of this vector. This is taking the element Y's difference in absolute values between these two encodings. And what you might do is think of these 128 numbers as features that you then feed into logistic regression. And, you'll find that little regression can have additional parameters w, i, and b similar to a normal logistic regression unit. And you would train appropriate waiting on these 128 features in order to predict whether or not these two images are of the same person or of different persons. So, this will be one pretty useful way to learn to predict zero or one whether these are the same person or different persons. And there are a few other variations on how you can compute this formula that I had underlined in green. For example, another formula could be this k minus f of x j, k squared divided by f of x i on plus f of x j k. This is sometimes called the chi square form. This is the Greek alphabet chi. But this is sometimes called a chi square similarity. And this and other variations are explored in this deep face paper, which I referenced earlier as well. So in this learning formulation, the input is a pair of images, so this is really your training input x and the output y is either zero or one depending on whether you're inputting a pair of similar or dissimilar images. And same as before, you're training is Siamese Network so that means that, this neural network up here has parameters that are what they're really tied to the parameters in this lower neural network. And this system can work pretty well as well. Lastly, just to mention, one computational trick that can help neural deployment significantly, which is that, if this is the new image, so this is an employee walking in hoping that the turnstile the doorway will open for them and that this is from your database image. Then instead of having to compute, this embedding every single time, where you can do is actually pre-compute that, so, when the new employee walks in, what you can do is use this upper components to compute that encoding and use it, then compare it to your pre-computed encoding and then use that to make a prediction y hat. Because you don't need to store the raw images and also because if you have a very large database of employees, you don't need to compute these encodings every single time for every employee database. This idea of free computing, some of these encodings can save a significant computation. And this type of pre-computation works both for this type of Siamese Central architecture where you treat face recognition as a binary classification problem, as well as, when you were learning encodings maybe using the Triplet Loss function as described in the last couple of videos. And so just to wrap up, to treat face verification supervised learning, you create a training set of just pairs of images now is of triplets of pairs of images where the target label is one. When these are a pair of pictures of the same person and where the tag label is zero, when these are pictures of different persons and you use different pairs to train the neural network to train the scientists that were using back propagation. So, this version that you just saw of treating face verification and by extension face recognition as a binary classification problem, this works quite well as well. And so with that, I hope that you now know, what it would take to train your own face verification or your own face recognition system, one that can do one shot learning.

What is neural style transfer? - 2m

0:00
One of the most fun and exciting applications of ConvNet recently has been Neural Style Transfer. You get to implement this yourself and generate your own artwork in the problem exercise. But what is Neural Style Transfer? Let me show you a few examples. Let's say you take this image, this is actually taken from the Stanford University not far from my Stanford office and you want this picture recreated in the style of this image on the right. This is actually Van Gogh's, Starry Night painting. What Neural Style Transfer allows you to do is generated new image like the one below which is a picture of the Stanford University Campus that painted but drawn in the style of the image on the right. In order to describe how you can implement this yourself, I'm going to use C to denote the content image, S to denote the style image, and G to denote the image you will generate. Here's another example, let's say you have this content image so let's see this is of the Golden Gate Bridge in San Francisco and you have this style image, this is actually Pablo Picasso image. You can then combine these to generate this image G which is the Golden Gate painted in the style of that Picasso shown on the right. The examples shown on this slide were generated by Justin Johnson. What you'll learn in the next few videos is how you can generate these images yourself. In order to implement Neural Style Transfer, you need to look at the features extracted by ConvNet at various layers, the shallow and the deeper layers of a ConvNet. Before diving into how you can implement a Neural Style Transfer, what I want to do in the next video is try to give you better intuition about whether all these layers of a ConvNet really computing. Let's take a look at that in the next video.

What are deep ConvNets learning? - 7m

0:00
What are deep ConvNets really learning? In this video, I want to share with you some visualizations that will help you hone your intuition about what the deeper layers of a ConvNet really are doing. And this will help us think through how you can implement neural style transfer as well. Let's start with an example. Lets say you've trained a ConvNet, this is an alex net like network, and you want to visualize what the hidden units in different layers are computing. Here's what you can do. Let's start with a hidden unit in layer 1. And suppose you scan through your training sets and find out what are the images or what are the image patches that maximize that unit's activation. So in other words pause your training set through your neural network, and figure out what is the image that maximizes that particular unit's activation. Now, notice that a hidden unit in layer 1, will see only a relatively small portion of the neural network. And so if you visualize, if you plot what activated unit's activation, it makes makes sense to plot just a small image patches, because all of the image that that particular unit sees. So if you pick one hidden unit and find the nine input images that maximizes that unit's activation, you might find nine image patches like this. So looks like that in the lower region of an image that this particular hidden unit sees, it's looking for an egde or a line that looks like that. So those are the nine image patches that maximally activate one hidden unit's activation. Now, you can then pick a different hidden unit in layer 1 and do the same thing. So that's a different hidden unit, and looks like this second one, represented by these 9 image patches here. Looks like this hidden unit is looking for a line sort of in that portion of its input region, we'll also call this receptive field. And if you do this for other hidden units, you'll find other hidden units, tend to activate in image patches that look like that. This one seems to have a preference for a vertical light edge, but with a preference that the left side of it be green. This one really prefers orange colors, and this is an interesting image patch. This red and green together will make a brownish or a brownish-orangish color, but the neuron is still happy to activate with that, and so on. So this is nine different representative neurons and for each of them the nine image patches that they maximally activate on. So this gives you a sense that, units, train hidden units in layer 1, they're often looking for relatively simple features such as edge or a particular shade of color. And all of the examples I'm using in this video come from this paper by Mathew Zeiler and Rob Fergus, titled visualizing and understanding convolutional networks. And I'm just going to use one of the simpler ways to visualize what a hidden unit in a neural network is computing. If you read their paper, they have some other more sophisticated ways of visualizing when the ConvNet is running as well.
3:22
But now you have repeated this procedure several times for nine hidden units in layer 1. What if you do this for some of the hidden units in the deeper layers of the neuron network. And what does the neural network then learning at a deeper layers. So in the deeper layers, a hidden unit will see a larger region of the image. Where at the extreme end each pixel could hypothetically affect the output of these later layers of the neural network. So later units are actually seen larger image patches, I'm still going to plot the image patches as the same size on these slides. But if we repeat this procedure, this is what you had previously for layer 1, and this is a visualization of what maximally activates nine different hidden units in layer 2. So I want to be clear about what this visualization is. These are the nine patches that cause one hidden unit to be highly activated. And then each grouping, this is a different set of nine image patches that cause one hidden unit to be activated. So this visualization shows nine hidden units in layer 2, and for each of them shows nine image patches that causes that hidden unit to have a very large output, a very large activation. And you can repeat these for deeper layers as well. Now, on this slide, I know it's kind of hard to see these tiny little image patches, so let me zoom in for some of them. For layer 1, this is what you saw. So for example, this is that first unit we saw which was highly activated, if in the region of the input image, you can see there's an edge maybe at that angle. Now let's zoom in for layer 2 as well, to that visualization. So this is interesting, layer 2 looks it's detecting more complex shapes and patterns. So for example, this hidden unit looks like it's looking for a vertical texture with lots of vertical lines. This hidden unit looks like its highly activated when there's a rounder shape to the left part of the image. Here's one that is looking for very thin vertical lines and so on. And so the features the second layer is detecting are getting more complicated. How about layer 3? Let's zoom into that, in fact let me zoom in even bigger, so you can see this better, these are the things that maximally activate layer 3. But let's zoom in even bigger, and so this is pretty interesting again. It looks like there is a hidden unit that seems to respond highly to a rounder shape in the lower left hand portion of the image, maybe. So that ends up detecting a lot of cars, dogs and wonders is even starting to detect people. And this one look like it is detecting certain textures like honeycomb shapes, or square shapes, this irregular texture. And some of these it's difficult to look at and manually figure out what is it detecting, but it is clearly starting to detect more complex patterns. How about the next layer? Well, here is layer 4, and you'll see that the features or the patterns is detecting or even more complex. It looks like this has learned almost a dog detector, but all these dogs likewise similar, right? Is this, I don't know what dog species or dog breed this is. But now all those are dogs, but they look relatively similar as dogs go. Looks like this hidden unit and therefore it is detecting water.
6:53
This looks like it is actually detecting the legs of a bird and so on. And then layer 5 is detecting even more sophisticated things. So you'll notice there's also a neuron that seems to be a dog detector, but set of dogs detecting here seems to be more varied. And then this seems to be detecting keyboards and things with a keyboard like texture, although maybe lots of dots against background. I think this neuron here may be detecting text, it's always hard to be sure. And then this one here is detecting flowers. So we've gone a long way from detecting relatively simple things such as edges in layer 1 to textures in layer 2, up to detecting very complex objects in the deeper layers. So I hope this gives you some better intuition about what the shallow and deeper layers of a neural network are computing. Next, let's use this intuition to start building a neural-style transfer algorithm.

Cost Function - 3m

0:00
To build a Neural Style Transfer system, let's define a cost function for the generated image. What you see later is that by minimizing this cost function, you can generate the image that you want. Remember what the problem formulation is. You're given a content image C, given a style image S and you goal is to generate a new image G. In order to implement neural style transfer, what you're going to do is define a cost function J of G that measures how good is a particular generated image and we'll use gradient to descent to minimize J of G in order to generate this image. How good is a particular image? Well, we're going to define two parts to this cost function. The first part is called the content cost. This is a function of the content image and of the generated image and what it does is it measures how similar is the contents of the generated image to the content of the content image C. And then going to add that to a style cost function which is now a function of S,G and what this does is it measures how similar is the style of the image G to the style of the image S. Finally, we'll weight these with two hyper parameters alpha and beta to specify the relative weighting between the content costs and the style cost. It seems redundant to use two different hyper parameters to specify the relative cost of the weighting. One hyper parameter seems like it would be enough but the original authors of the Neural Style Transfer Algorithm, use two different hyper parameters. I'm just going to follow their convention here. The Neural Style Transfer Algorithm I'm going to present in the next few videos is due to Leon Gatys, Alexander Ecker and Matthias. Their papers is not too hard to read so after watching these few videos if you wish, I certainly encourage you to take a look at their paper as well if you want. The way the algorithm would run is as follows, having to find the cost function J of G in order to actually generate a new image what you do is the following. You would initialize the generated image G randomly so it might be 100 by 100 by 3 or 500 by 500 by 3 or whatever dimension you want it to be. Then we'll define the cost function J of G on the previous slide. What you can do is use gradient descent to minimize this so you can update G as G minus the derivative respect to the cost function of J of G. In this process, you're actually updating the pixel values of this image G which is a 100 by 100 by 3 maybe rgb channel image. Here's an example, let's say you start with this content image and this style image. This is a another probably Picasso image. Then when you initialize G randomly, you're initial randomly generated image is just this white noise image with each pixel value chosen at random. As you run gradient descent, you minimize the cost function J of G slowly through the pixel value so then you get slowly an image that looks more and more like your content image rendered in the style of your style image. In this video, you saw the overall outline of the Neural Style Transfer Algorithm where you define a cost function for the generated image G and minimize it. Next, we need to see how to define the content cost function as well as the style cost function. Let's take a look at that starting in the next video.

Content Cost Function - 3m

0:00
The cost function of the neural style transfer algorithm had a content cost component and a style cost component. Let's start by defining the content cost component. Remember that this is the overall cost function of the neural style transfer algorithm. So, let's figure out what should the content cost function be. Let's say that you use hidden layer l to compute the content cost. If l is a very small number, if you use hidden layer one, then it will really force your generated image to pixel values very similar to your content image. Whereas, if you use a very deep layer, then it's just asking, "Well, if there is a dog in your content image, then make sure there is a dog somewhere in your generated image. " So in practice, layer l chosen somewhere in between. It's neither too shallow nor too deep in the neural network. And because you plan this yourself, in the problem exercise that you did at the end of this week, I'll leave you to gain some intuitions with the concrete examples in the problem exercise as well. But usually, I was chosen to be somewhere in the middle of the layers of the neural network, neither too shallow nor too deep. What you can do is then use a pre-trained ConvNet, maybe a VGG network, or could be some other neural network as well. And now, you want to measure, given a content image and given a generated image, how similar are they in content. So let's let this a_superscript_l and this be the activations of layer l on these two images, on the images C and G. So, if these two activations are similar, then that would seem to imply that both images have similar content. So, what we'll do is define J_content(C,G) as just how soon or how different are these two activations. So, we'll take the element-wise difference between these hidden unit activations in layer l, between when you pass in the content image compared to when you pass in the generated image, and take that squared. And you could have a normalization constant in front or not, so it's just one of the two or something else. It doesn't really matter since this can be adjusted as well by this hyperparameter alpha. So, just be clear on using this notation as if both of these have been unrolled into vectors, so then, this becomes the square root of the l_2 norm between this and this, after you've unrolled them both into vectors. There's really just the element-wise sum of squared differences between these two activation. But it's really just the element-wise sum of squares of differences between the activations in layer l, between the images in C and G. And so, when later you perform gradient descent on J_of_G to try to find a value of G, so that the overall cost is low, this will incentivize the algorithm to find an image G, so that these hidden layer activations are similar to what you got for the content image. So, that's how you define the content cost function for the neural style transfer. Next, let's move on to the style cost function.

Style Cost Function - 13m

0:00
In the last video, you saw how to define the content cost function for the neural style transfer. Next, let's take a look at the style cost function. So, what is the style of an image mean? Let's say you have an input image like this, they used to seeing a convnet like that, compute features that there's different layers. And let's say you've chosen some layer L, maybe that layer to define the measure of the style of an image. What we need to do is define the style as the correlation between activations across different channels in this layer L activation. So here's what I mean by that. Let's say you take that layer L activation. So this is going to be nh by nw by nc block of activations, and we're going to ask how correlated are the activations across different channels. So to explain what I mean by this may be slightly cryptic phrase, let's take this block of activations and let me shade the different channels by a different colors. So in this below example, we have say five channels and which is why I have five shades of color here. In practice, of course, in neural network we usually have a lot more channels than five, but using just five makes it drawing easier. But to capture the style of an image, what you're going to do is the following. Let's look at the first two channels. Let's see for the red channel and the yellow channel and say how correlated are activations in these first two channels. So, for example, in the lower right hand corner, you have some activation in the first channel and some activation in the second channel. So that gives you a pair of numbers. And what you do is look at different positions across this block of activations and just look at those two pairs of numbers, one in the first channel, the red channel, one in the yellow channel, the second channel. And you just look at these two pairs of numbers and see when you look across all of these positions, all of these nh by nw positions, how correlated are these two numbers. So, why does this capture style? Let's look another example. Here's one of the visualizations from the earlier video. This comes from again the paper by Matthew Zeiler and Rob Fergus that I have reference earlier. And let's say for the sake of arguments, that the red neuron corresponds to, and let's say for the sake of arguments, that the red channel corresponds to this neurons so we're trying to figure out if there's this little vertical texture in a particular position in the nh and let's say that this second channel, this yellow second channel corresponds to this neuron, which is vaguely looking for orange colored patches. What does it mean for these two channels to be highly correlated? Well, if they're highly correlated what that means is whatever part of the image has this type of subtle vertical texture, that part of the image will probably have these orange-ish tint. And what does it mean for them to be uncorrelated? Well, it means that whenever there is this vertical texture, it's probably won't have that orange-ish tint. And so the correlation tells you which of these high level texture components tend to occur or not occur together in part of an image and that's the degree of correlation that gives you one way of measuring how often these different high level features, such as vertical texture or this orange tint or other things as well, how often they occur and how often they occur together and don't occur together in different parts of an image. And so, if we use the degree of correlation between channels as a measure of the style, then what you can do is measure the degree to which in your generated image, this first channel is correlated or uncorrelated with the second channel and that will tell you in the generated image how often this type of vertical texture occurs or doesn't occur with this orange-ish tint and this gives you a measure of how similar is the style of the generated image to the style of the input style image. So let's now formalize this intuition. So what you can to do is given an image computes something called a style matrix, which will measure all those correlations we talks about on the last slide. So, more formally, let's let a superscript l, subscript i, j,k denote the activation at position i,j,k in hidden layer l. So i indexes into the height, j indexes into the width, and k indexes across the different channels. So, in the previous slide, we had five channels that k will index across those five channels. So what the style matrix will do is you're going to compute a matrix clauses G superscript square bracketed l. This is going to be an nc by nc dimensional matrix, so it'd be a square matrix. Remember you have nc channels and so you have an nc by nc dimensional matrix in order to measure how correlated each pair of them is. So particular G, l, k, k prime will measure how correlated are the activations in channel k compared to the activations in channel k prime. Well here, k and k prime will range from 1 through nc, the number of channels they're all up in that layer. So more formally, the way you compute G, l and I'm just going to write down the formula for computing one elements. So the k, k prime elements of this. This is going to be sum of a i, sum of a j, of deactivation and that layer i, j, k times the activation at i, j, k prime. So, here, remember i and j index across to a different positions in the block, indexes over the height and width. So i is the sum from one to nh and j is a sum from one to nw and k here and k prime index over the channel so k and k prime range from one to the total number of channels in that layer of the neural network. So all this is doing is summing over the different positions that the image over the height and width and just multiplying the activations together of the channels k and k prime and that's the definition of G,k,k prime. And you do this for every value of k and k prime to compute this matrix G, also called the style matrix. And so notice that if both of these activations tend to be lashed together, then G, k, k prime will be large, whereas if they are uncorrelated then g,k, k prime might be small. And technically, I've been using the term correlation to convey intuition but this is actually the unnormalized cross of the areas because we're not subtracting out the mean and this is just multiplied by these elements directly. So this is how you compute the style of an image. And you'd actually do this for both the style image s,n for the generated image G. So just to distinguish that this is the style image, maybe let me add a round bracket S there, just to denote that this is the style image for the image S and those are the activations on the image S. And what you do is then compute the same thing for the generated image. So it's really the same thing summarized sum of a j, a, i, j, k, l, a, i, j,k,l and the summation indices are the same. Let's follow this and you want to just denote this is for the generated image, I'll just put the round brackets G there. So, now, you have two matrices they capture what is the style with the image s and what is the style of the image G. And, by the way, we've been using the alphabet capital G to denote these matrices. In linear algebra, these are also called the grand matrix of these in called grand matrices but in this video, I'm just going to use the term style matrix because this term grand matrix that most of these using capital G to denote these matrices. Finally, the cost function, the style cost function. If you're doing this on layer l between s and G, you can now define that to be just the difference between these two matrices, G l, G square and these are matrices. So just take it from the previous one. This is just the sum of squares of the element wise differences between these two matrices and just divides this out this is going to be sum over k, sum over k prime of these differences of s, k, k prime minus G l, G, k, k prime and then the sum of square of the elements. The authors actually used this for the normalization constants two times of nh, nw, in that layer, nc in that layer and I'll square this and you can put this up here as well. But a normalization constant doesn't matter that much because this causes multiplied by some hyperparameter b anyway. So just to finish up, this is the style cost function defined using layer l and as you saw on the previous slide, this is basically the Frobenius norm between the two star matrices computed on the image s and on the image G Frobenius on squared and never by the just low normalization constants, which isn't that important. And, finally, it turns out that you get more visually pleasing results if you use the style cost function from multiple different layers. So, the overall style cost function, you can define as sum over all the different layers of the style cost function for that layer. We should define the book weighted by some set of parameters, by some set of additional hyperparameters, which we'll denote as lambda l here. So what it does is allows you to use different layers in a neural network. Well of the early ones, which measure relatively simpler low level features like edges as well as some later layers, which measure high level features and cause a neural network to take both low level and high level correlations into account when computing style. And, in the following exercise, you gain more intuition about what might be reasonable choices for this type of parameter lambda as well. And so just to wrap this up, you can now define the overall cost function as alpha times the content cost between c and G plus beta times the style cost between s and G and then just create in the sense or a more sophisticated optimization algorithm if you want in order to try to find an image G that normalize, that tries to minimize this cost function j of G. And if you do that, you can generate pretty good looking neural artistic and if you do that you'll be able to generate some pretty nice novel artwork. So that's it for neural style transfer and I hope you have fun implementing it in this week's printing exercise. Before wrapping up this week, there's just one last thing I want to share of you, which is how to do convolutions over 1D or 3D data rather than over only 2D images. Let's go into the last video.

1D and 3D Generalizations - 9m

0:00
You have learned a lot about ConvNets, everything ranging from the architecture of the ConvNet to how to use it for image recognition, to object detection, to face recognition and neural-style transfer. And even though most of the discussion has focused on images, on sort of 2D data, because images are so pervasive. It turns out that many of the ideas you've learned about also apply, not just to 2D images but also to 1D data as well as to 3D data. Let's take a look. In the first week of this course, you learned about the 2D convolution, where you might input a 14 x 14 image and convolve that with a 5 x 5 filter. And you saw how 14 x 14 convolved with 5 x 5, this gives you a 10 x 10 output. And if you have multiple channels, maybe those 14 x 14 x 3, then it would be 5 x 5 that matches the same 3. And then if you have multiple filters, say 16 filters, you end up with 10 x 10 x 16. It turns out that a similar idea can be applied to 1D data as well. For example, on the left is an EKG signal, also called an electrocardioagram. Basically if you place an electrode over your chest, this measures the little voltages that vary across your chest as your heart beats. Because the little electric waves generated by your heart's beating can be measured with a pair of electrodes. And so this is an EKG of someone's heart beating. And so each of these peaks corresponds to one heartbeat. So if you want to use EKG signals to make medical diagnoses, for example, then you would have 1D data because what EKG data is, is it's a time series showing the voltage at each instant in time. So rather than a 14 x 14 dimensional input, maybe you just have a 14 dimensional input. And in that case, you might want to convolve this with a 1 dimensional filter. So rather than the 5 by 5, you just have 5 dimensional filter. So with 2D data what a convolution will allow you to do was to take the same 5 x 5 feature detector and apply it across at different positions throughout the image. And that's how you wound up with your 10 x 10 output. What a 1D filter allows you to do is take your 5 dimensional filter and similarly apply that in lots of different positions throughout this 1D signal. And so if you apply this convolution, what you find is that a 14 dimensional thing convolved with this 5 dimensional thing, this would give you a 10 dimensional output. And again, if you have multiple channels, you might have in this case you can use just 1 channel, if you have 1 lead or 1 electrode for EKG, so times 5 x 1. And if you have 16 filters, maybe end up with 10 x 16 over there, and this could be one layer of your ConvNet. And then for the next layer of your ConvNet, if you input a 10 x 16 dimensional input and you might convolve that with a 5 dimensional filter again. Then these have 16 channels, so that has a match. And we have 32 filters, then the output of another layer would be 6 x 32, if you have 32 filters, right? And the analogy to the the 2D data, this is similar to all of the 10 x 10 x 16 data and convolve it with a 5 x 5 x 16, and that has to match. That will give you a 6 by 6 dimensional output, and you have 32 filters, that's where the 32 comes from. So all of these ideas apply also to 1D data, where you can have the same feature detector, such as this, apply to a variety of positions. For example, to detect the different heartbeats in an EKG signal. But to use the same set of features to detect the heartbeats even at different positions along these time series, and so ConvNet can be used even on 1D data. For along with 1D data applications, you actually use a recurrent neural network, which you learn about in the next course. But some people can also try using ConvNets in these problems. And in the next course on sequence models, which we will talk about recurring neural networks and LCM and other models like that. We'll talk about the pros and cons of using 1D ConvNets versus some of those other models that are explicitly designed to sequenced data. So that's the generalization from 2D to 1D. How about 3D data? Well, what is three dimensional data? It is that, instead of having a 1D list of numbers or a 2D matrix of numbers, you now have a 3D block, a three dimensional input volume of numbers. So here's the example of that which is if you take a CT scan, this is a type of X-ray scan that gives a three dimensional model of your body. But what a CT scan does is it takes different slices through your body. So as you scan through a CT scan which I'm doing here, you can look at different slices of the human torso to see how they look and so this data is fundamentally three dimensional. And one way to think of this data is if your data now has some height, some width, and then also some depth. Where this is the different slices through this volume, are the different slices through the torso. So if you want to apply a ConvNet to detect features in this three dimensional CAT scan or CT scan, then you can generalize the ideas from the first slide to three dimensional convolutions as well. So if you have a 3D volume, and for the sake of simplicity let's say is 14 x 14 x 14 and so this is the height, width, and depth of the input CT scan. And again, just like images they'll all have to be square, a 3D volume doesn't have to be a perfect cube as well. So the height and width of a image can be different, and in the same way the height and width and the depth of a CT scan can be different. But I'm just using 14 x 14 x 14 here to simplify the discussion. And if you convolve this with a now a 5 x 5 x 5 filter, so you're filters now are also three dimensional then this would give you a 10 x 10 x 10 volume. And technically, you could also have by 1, if this is the number of channels. So this is just a 3D volume, but your data can also have different numbers of channels, then this would be times 1 as well. Because the number of channels here and the number of channels here has to match. And then if you have 16 filters did a 5 x 5 x 5 x 1 then the next output will be a 10 x 10 x 10 x 16. So this could be one layer of your ConvNet over 3D data, and if the next layer of the ConvNet convolves this again with a 5 x 5 x 5 x 16 dimensional filter. So this number of channels has to match data as usual, and if you have 32 filters then similar to what you saw was ConvNet of the images. Now you'll end up with a 6 x 6 x 6 volume across 32 channels. So 3D data can also be learned on, sort of directly using a three dimensional ConvNet. And what these filters do is really detect features across your 3D data,
8:08
CAT scans, medical scans as one example of 3D volumes. But another example of data, you could treat as a 3D volume would be movie data, where the different slices could be different slices in time through a movie. And you could use this to detect motion or people taking actions in movies. So that's it on generalization of ConvNets from 2D data to also 1D as well as 3D data. Image data is so pervasive that the vast majority of ConvNets are on 2D data, on image data, but I hope that these other models will be helpful to you as well. So this is it, this is the last video of this week and the last video of this course on ConvNets. You've learned a lot about ConvNets and I hope you find many of these ideas useful for your future work. So congratulations on finishing these videos. I hope you enjoyed this week's exercise and I look forward also to seeing you in the next course on sequence models.

posted @ 2019-04-14 00:19  KerShaw  阅读(321)  评论(0编辑  收藏  举报