Training (deep) Neural Networks Part: 1

Nowadays training deep learning models have become extremely easy with high-quality libraries such as Torch and Theano. These libraries are really helpful for rapidly prototyping deep learning models even without understanding much about deep learning algorithms. However, the underpinning of algorithms will help us to get maximum benefits from above deep learning libraries. Therefore, this (and upcoming) tutorials will discuss few deep learning algorithms and implement them using Python/Numpy

Introduction to Neural Network

Neural networks are made out of logistic units and depending on the way you arrange those units, you might get different neural network architectures. Figure: 1 depicts the schematic diagram of a logistic unit. You can consider it as a function which takes a input vector (

$Architecture of a logistic unit. Vector $X=[1, x_{1}, x_{2}, ... , x_{6}]^{T}$ represents input data to the logistic unit, vector $W=[w_{0}, w_{1}, w_{2}, ... , w_{6}]^{T}$ represents connection weights and $y$ is the output of the logistic unit. Logistic unit takes $X$ as input and then calculate dot product between $X$, and $W$ (i.e. component-wise multiplication, $1w_{0} + x_{1}w_{1} + ... + x_{6}w_{6}$ or using linear algebra notations $X^{T}W$). The result of $X^{T}W$ is fed to an appropriate non-linear function and produces the output $y$.$

Figure 1: Architecture of a logistic unit. Vector

The logistic unit itself works as a binary classifier and we called it as the logistic regression. For instance, with appropriate weight vectors

Logistic and tanh are two popular non-linear functions which can be used in logistic units. Figure: 2 shows graphs of these two functions for

$$logistic(x) = 1/(1 + e^{-x})$, and $tanh(x) = (e^{x} - e^{-x})/(e^{x} + e^{-x})$ are two commonly used activation functions. The logistic function maps a real-valued input to the output between 0 and 1. For large positive inputs, the output of logistic function reaches 1 and for large negative values it reaches 0. However, values of tanh function are bounded between -1 and +1. Logistic function was a popular choice of non-linear function in early neural network implementations. But, nowadays tanh has become more popular than logistic non-linearity.$

Figure 2:

Single Logistic unit works well with datasets which can be linearly separable. However, it doesn't perform so well with linearly non-separable datasets. This can be easily demonstrated with two synthetically generated datasets shown in Figure: 3.

Fitting logistic regression to two synthetically generated datasets with added Gaussian noise. Left: Dataset shows a linear decision boundary and logistic regression performs well. Right: 2D dataset with non-linear decision boundary, logistic regression performs not so well.

Figure 3: Fitting logistic regression to two synthetically generated datasets with added Gaussian noise. Left: Dataset shows a linear decision boundary and logistic regression performs well. Right: 2D dataset with non-linear decision boundary, logistic regression performs not so well.

So it is clear that logistic regression is capable of classifying linearly separable datasets. Unfortunately, most of the datasets we come across in practical machine learning problems do not show linearly separable property. Hence, we need better classifiers than logistic regression.

Building classifiers with more logistic units would be an obvious and natural approach to overcoming the limitations of a single logistic unit. The classifier which addresses limitations of a single logistic unit is known as the Neural Network. Usually, neural networks arrange logistic units into layers and depending on the orientation of these layers, we have different neural network architectures. Figure: 4 shows a typical neural network with 4 layers. The first layer (denoted by

$Neural network with two hidden layers. $L_1$ represents input we feed into the network. $L_2$ and $L_3$ represent hidden layers and $L_4$ denotes the output of the neural network. $W_1$, $W_2$, and $W_3$ represent input to hidden layer 1, hidden layer 1 to hidden layer 2, and hidden layer 2 to output layer weight matrices respectively. It is to be noted that, we have an extra unit, denoted by $+1$ in each layer (except output layer) and we call it as bias unit. Also, weight vector from bias unit to all units in following layer is usually denoted by $\bf{b}$. So in this figure we have $\bf{b_1}$, $\bf{b_2}$ and $\bf{b_3}$ as bias vectors.$

Figure 4: Neural network with two hidden layers.

In Figure: 4,

Training Procedure

In this section, we introduce key steps of neural network training process. First, we formulate neural network training as an optimization problem. Then, the gradient descent will be introduced as a technique for solving this optimization. Finally, we discuss automatic differentiation as an efficient method for calculating error derivatives of a function w.r.t. its parameters.

Empirical Risk minimization

As we pointed out above, neural network training can be considered as an optimization probable. The framework we use to formulate this optimization problem is known as empirical risk minimization. It is a generic principle and useful in many areas of machine learning. For more details about empirical risk minimization, please read Chapter 4 of [1].

Let me explain empirical risk minimization with an example. Suppose you have developed a neural network for classifying handwritten digits. During training time you input training images (actually, raw pixel intensities) and network predicts most probable digit associated with each image. Suppose you have a function (say

E (θ) = 1 N \sum i L ( f ( X ( i ) ; θ ) , y ( i ) )

In Equation:

In practice, we add an extra term to

E (θ) = 1 N \sum i L ( f ( X ( i ) ; θ ) , y ( i ) ) +

So now we have formulated neural network training as an optimization problem. Therefore, next step would be to find a suitable algorithm which can be used to optimize the empirical loss function given in Equation:

Gradient Descent

In this section, we discuss a simple yet powerful optimization algorithm called gradient descent. Actually, we will be using one of its improved versions for training neural networks. However, having a good understanding on vanilla gradient descent is essential to understand those algorithms. Therefore, in this tutorial we are going to use vanilla gradient descent. But, in upcoming tutorials we will be using few improved versions for training neural networks.

Let's start our discussion with a simple example:

Tangents to the $f(x)=x^2$ at $p_1=(1,1)$ and $p_2=(-2, 4)$. Derivatives of the function $f(x)=x^2$ at $p_1$ and $p_2$ represent slopes of tangent lines to graph at respective points.

Figure 5: Tangents to the

From Figure: 5 it is clear that if you follow the negative direction of the derivative and small step at a time, you will be moving towards the minimum of the

θ (t + 1) = θ (t) - η * \nabla θ f (θ) (3)

Now let's move to the implementation of gradient descent algorithm in Python.

 1 def get_grad(x):
 2     """ This method returns the derivative of f(x)=x^2 function"""
 3     return 2*x
 4 
 5 #initial guess
 6 x = 10
 7 #learning rate
 8 eta = 0.01
 9 
10 num_iterations = 500
11 for i in range(num_iterations):
12     x = x - eta*get_grad(x)
13     if i % 50 == 0:
14         print('Iteration: {:3d} x: {:.3e} f(x): {:.3e}'.format(i, x, x**2))
15 print('Iteration: {:3d} x: {:.3e} f(x): {:.3e}'.format(i, x, x**2))

Program 1: Finding minimum value of

$$f(x,y)=xe^{-(x^2+y^2)}$ in $x=[-1, 1]$, and $y=[-1, 1]$ region$

Figure 7:

Now let's say we are going to find the minimum point of the function

Automatic Differentiation (AD)

In previous sections, we formulated neural network training as an optimization problem. Also, the gradient descent was introduced for obtaining minimum points of the loss function given in Equation:

Discovered independently by several different search groups in the 1970s and 1980s, thebackpropagation algorithm has been using as the main tool for calculating error derivatives of the lost function w.r.t. model parameters. The key idea of the backpropagation algorithm is that error derivatives can be calculated by starting at the output layer of the network and moving towards the input layer. So error derivaties of the

However, in this tutorial instead of backpropagation we will be using a more general technique called reverse-mode automatic differentiation for calculating derivatives of the loss w.r.t. model parameters. Actually, backpropagation is a specialized version of the everse-mode automatic differentiation. Unlike backpropagation, reverse-mode automatic differentiation can be used for calculating derivatives of any computational graphs. Though, it is heavily underused in machine learning, automatic differentiation is a well-established technique used in some other scientific disciplines such as fluid dynamics and nuclear engineering.

Let's consider simple function

forward phase of the computational graph of expression $f(x,y)= e^x + xy$.

Figure 8: forward phase of the computational graph of expression

In Figure: 8, we have introduced three intermediate variables (

Second phase, commonly known as backward pass starts at the bottom of the graph and movies towards inputs. During the backward pass, we calculate derivatives of the output w.r.t. intermediate variables and finally, w.r.t. input variables. Figure: 9 shows the backward pass of

Backward pass of the computational graph of expression $f(x,y)= e^x + xy$.

Figure 9: Backward pass of the computational graph of expression

We start backward pass at the bottom of Figure: 9 and first calculate

Next two step (i.e.

Calculating

Now we have a good understanding of the mechanics of reverse-mode automatic differentiation and we are ready to use it for calculating error derivatives of neural networks. Also, it is worth mentioning that, automatic differentiation is neither fully analytical nor numerical algorithm. At the elementary operations level (such as

Training Shallow Networks

In previous sections, we have discussed a lot of necessary tools for training neural networks. Now it's time to put those tools into practice. But, before moving to full–fledged neural networks, we would like to start with a simple linear network called Softmax Classifier.

$Pictorial representation of a typical softmax classifier. This Softmax Classifier takes input $X \in \mathbb{R}^{4}$ and produces output $y \in \mathbb{R}^{9}$. The connection weights between the input layer (denoted by $L_1$) and the output layer (denoted by $L_2$) are represented by $W \in \mathbb{R}^{4X9}$. Additionally, connection weights between bias unit and the output units are represented by $b \in \mathbb{R}^{9}$. Number of features and distinct output categories decide the architecture (i.e. number of units in input and output layer) of the softmax classifier.$

Figure 10: Pictorial representation of a typical softmax classifier. This Softmax Classifier takes input

Since, we are in the classification setting, it would be very nice to interpret output values of the network as probabilities. However, the pre-activations of the output layer (denoted by

p i = e a i \sum j e a j (4)

Next, in order to use the gradient descent for training our network, we have to devise a suitable cost function which quantifies the discrepancy between predicted and actual classes. In the remaining part of this section, we derive a loss function called cross-entropy loss that will be using in our softmax classifier.

Technically speaking, our output layer calculates conditional probabilities. For instance, if we consider

E (θ) = 1 N \sum n - l o g e ( p k )  data loss: L( θ ) +

Where

So we have discussed a loss function, an efficient technique for calculating error derivatives of the loss function and a optimization algorithm. Hence, now we are ready to train our softmax classifier. Actually, we will be building a handwritten digit recognizer using the softmax classifier.

MNIST Dataset

For building our handwritten digit recognizer, we are going to use MNIST dataset [2]. It is one of the most well-known datasets in the field of machine learning. MNIST dataset consists of 60,000 training and 10,000 testing black and white images of 28x28 pixels. Figure: 11 shows few sample images extracted by MNIST dataset.

Figure 11: Few sample images extracted from popular MNIST dataset. The dataset consists of 60,000 training and 10,000 testing black and white images. All images are 28x28 in pixel, that means a single image contains 784 features.

Implementing Softmax Classifier in Python/Numpy

Program: 2 shows our SoftmaxLayer implementation. It consists of three methods:forward_pass, backward_pass and update_parameters. forward_pass is easy to understand and it first calculates pre-activation using

Figure 12: Computational graph of the softmax classifier.

However, backward_pass is a little bit completed. Therefore, we use the computational flow graph shown in Figure: 12 to understand backward_pass. In order to use gradient descent, we would like to calculate

Consider a training example

\partial L ( θ ) \partial a i = - l o g e e

Considering complete pre-activation vector

\partial L ( θ ) \partial b = \partial a \partial b \partial L (

\partial L ( θ ) \partial w = \partial a \partial w \partial L (

Since, the regulation loss doesn't have bias term,

 1 import numpy as np
 2 
 3 class SoftmaxLayer:
 4     """
 5         SoftmaxLayer class represents teh Softmax layer.
 6         Parameters
 7         ----------
 8         W : matrix W represents the input to output connection weight
 9         b : bias vector
10         reg_parameter : regularization parameter of the L2 regularizer
11     """
12     def __init__(self, W, b, reg_parameter, num_unique_categories):
13         self.W = W
14         self.b = b
15         self.reg_parameter = reg_parameter
16         self.num_unique_categories = num_unique_categories
17 
18     def forward_pass(self, x_input, y_input):
19         """
20         Performs forward pass and returns x_out_prob and total_loss
21         """
22         # calculates pre-activation using XW + b
23         x_hid = np.dot(x_input, self.W) + self.b
24 
25         # subtract np.max(x_hid) from each element of the x_hid
26         # for numerical stability
27         # detials: http://www.iro.umontreal.ca/~bengioy/dlbook/numerical.html
28         x_hid = x_hid - np.max(x_hid)
29         # calculate output probabilities using Equation 4
30         x_out_prob = np.exp(x_hid) / np.sum(np.exp(x_hid), axis=1, keepdims=True)
31 
32         # calculate data loss using -log_e(p_k)
33         num_examples = x_input.shape[0]
34         prob_target =  x_out_prob[range(num_examples), y_input]
35         data_loss_vector = -np.log(prob_target)
36         data_loss = np.sum(data_loss_vector) / num_examples
37 
38         reg_loss = self.reg_parameter * np.sum(self.W * self.W) * 0.5
39         total_loss = data_loss + reg_loss
40 
41         return x_out_prob, total_loss
42 
43     def backward_pass(self, x_out_prob, x_input, y_input):
44         """
45         Performs backward pass and calculates error derivatives.
46         """
47         # calculates error derivaties w.r.t. pre-activation
48         num_examples = y_input.shape[0]
49 
50         # creating one-hot encoding matrix from y_input
51         one_hot = np.zeros((x_input.shape[0], self.num_unique_categories))
52         one_hot[range(num_examples), y_input] = 1
53 
54         # Equation: 7
55         grad_output = x_out_prob - one_hot
56         grad_output = grad_output / num_examples
57 
58         # Equation: 8
59         grad_w = np.dot(x_input.T, grad_output)
60 
61         # adding regularization loss
62         grad_w = grad_w + self.reg_parameter*self.W
63 
64         grad_b = np.sum(grad_output, axis=0, keepdims=True)
65         return grad_w, grad_b
66 
67     def update_parameters(self, W, b):
68         """
69         Updating parameters of the model
70         """
71         self.W = W
72         self.b = b

Program 2: Softmax layer implemented in Python/Numpy
Source: https://github.com/upul/GNN/blob/master/lib/layers.py

Program: 3 shows our handwritten digit classifier. We have used batch Gradient Descent withBATCH_SIZE = 500. After completing 1000 training iterations, we managed to get ~0.904 accuracy using our testing set. Certainly, the accuracy of our handwritten digit recognizer is not that good. In upcoming tutorials, we will be implementing it again using more advanced techniques such as neural networks, deep neural networks and convolution neural networks.

 1 import sys
 2 import numpy as np
 3 
 4 from layers import SoftmaxLayer
 5 from datareader import load_mnist
 6 from constants import *
 7 
 8 x_train, y_train = load_mnist(MNIST_TRAINING_X , MNIST_TRAINING_y)
 9 x_train = x_train.reshape(MNIST_NUM_TRAINING, MNIST_NUM_FEATURES)
10 y_train = y_train.reshape(MNIST_NUM_TRAINING)
11 
12 # initialize parameters randomly
13 W = 0.1 * np.random.randn(MNIST_NUM_FEATURES, MNIST_NUM_OUTPUT)
14 b = np.zeros((1, MNIST_NUM_OUTPUT))
15 
16 learning_rate = 0.1 # step size of the gradient descent algorithm
17 reg_parameter = 0.01  # regularization strength
18 softmax = SoftmaxLayer(W, b, reg_parameter, MNIST_NUM_OUTPUT)
19 
20 num_iter = 1000
21 BATCH_SIZE = 500
22 for i in range(num_iter):
23 
24     idx = np.random.choice(MNIST_NUM_TRAINING, BATCH_SIZE, replace=True)
25     x_batch = x_train[idx, :]
26     y_batch = y_train[idx]
27     output_prob, loss = softmax.forward_pass(x_batch, y_batch)
28     if i % 50 == 0:
29         print('iteration: {:3d} loss: {:3e}'.format(i, loss))
30     gradW, gradB = softmax.backward_pass(output_prob, x_batch, y_batch)
31     W = W - learning_rate*gradW
32     b = b - learning_rate*gradB
33     softmax.update_parameters(W, b)
34 
35 # prediction
36 pred_prob = np.dot(x_train, W) + b
37 predicted_class = np.argmax(pred_prob, axis=1)
38 print('-----------------------------')
39 print('training setaccuracy: {:f}'.format(np.mean(predicted_class == y_train)))
40 print('-----------------------------')
41 print('\n')
42 
43 x_test, y_test = load_mnist(MNIST_TESTING_X, MNIST_TESTING_y)
44 x_test = x_test.reshape(MNIST_NUM_TESTING, MNIST_NUM_FEATURES)
45 y_test = y_test.reshape(MNIST_NUM_TESTING)
46 
47 
48 pred_prob = np.dot(x_test, W) + b
49 predicted_class = np.argmax(pred_prob, axis=1)
50 print('-----------------------------')
51 print('training set accuracy: {:f}'.format(np.mean(predicted_class == y_test)))
52 print('-----------------------------')

Program 3: MINIST handwritten digit classification program using Softmax classifier
Source: https://github.com/upul/GNN/blob/master/chapter3/softmaxclassifier.py

In Program 3: we have used magic numbers for few variables such as learning_rate andreg_parameter. In machine learning, those parameters are known as hyper-parameters and estimating proper values for hyper-parameters is an important step in developing machine learning applications. In upcoming tutorials, we will discuss hyper-parameter estimating techniques in great detail.

Loading Dataset

You can download MNIST dataset from "THE MNIST DATABASE" site [2]. You need four gz files:train-images-idx3-ubyte.gz, train-labels-idx1-ubyte.gz, t10k-images-idx3-ubyte.gz, andt10k-labels-idx1-ubyte.gz. Unpack downloaded files into a convenient location and replace<PATH> in the constants.py file with the location where you have unpacked those four files.

Python codes related to this and upcoming tutorials reside in the GNN GitHub repository [4]. Therefore, please to clone it and add lib folder to your PYTHONPATH environment variable.

 1 MNIST_TRAINING_X = '<PATH>/train-images.idx3-ubyte'
 2 MNIST_TRAINING_y = '<PATH>/dataset_gnn_book/train-labels.idx1-ubyte'
 3 
 4 MNIST_TESTING_X = '<PATH>/t10k-images.idx3-ubyte'
 5 MNIST_TESTING_y = '<PATH>/t10k-labels.idx1-ubyte'
 6 
 7 MNIST_NUM_TRAINING = 60000
 8 MNIST_NUM_TESTING = 10000
 9 MNIST_NUM_FEATURES = 784
10 MNIST_NUM_OUTPUT = 10

Program 2: constants.py contains parameters of our handwritten dataset.
Source: https://github.com/upul/GNN/blob/master/chapter3/constants.py

Conclusion

In this tutorial, we mainly focused on setting up the background. We started our discussion with the logistic regression and then moved to Neural Networks. Next, we introduced a powerful framework that we are going to use for training neural networks called "Empirical Risk Minimization". "Gradient Descent" was introduced as an algorithm for minimizing empirical risk. In order to work with gradient descent, we need derivatives of the lost function w.r.t. parameters of our neural network model (i.e.

In the next tutorial, we are going to extend our shallow network into a full-blown neural network. Also, we will discuss a bunch of new machine learning concepts and techniques in the next tutorial.

References

[1]. Machine learning: a probabilistic perspective (adaptive computation and machine learning series) KP Murphy - MIT Press. 2012

[2]. http://yann.lecun.com/exdb/mnist/

[3]. https://github.com/upul/GNN

posted @ 2015-10-14 19:25 菜鸡一枚阅读(629) 评论(0) 收藏举报

刷新页面返回顶部

菜鸡一枚

Training (deep) Neural Networks Part: 1

Training (deep) Neural Networks Part: 1

Introduction to Neural Network

Training Procedure

Empirical Risk minimization

Gradient Descent

Automatic Differentiation (AD)

Training Shallow Networks

MNIST Dataset

Implementing Softmax Classifier in Python/Numpy

Loading Dataset

Conclusion

References

公告