Deep Learning Flower Book 3

Numerical Computation

Overview

Because I have no idea about this chapter yet before. So I just try my best to record my view of this chapter. And maybe my record is incomplete.

Overflow and Underflow

Underflow occurs when numbers near zero are rounded to zero.
Overflow occurs when numbers with large magnitude are approximated as infinity

One example of a function that must be stabilized against underflow and overflow is the softmax function, which is used to predict the probabilities associated with a multiouli distribution.

We can stabilize this function by softmax(z) where z is equal to the delta with every x and the maximum x.

However, underflow in this way can still cause the expression as a whole to evaluate to zero.

Poor Conditioning

Conditioning refers to how rapidly a function changes with respect to small changes in its inputs.

Consider the function:

When A has a eigenvalue decomposition, its condition number is:

This is the ratio of the magnitude of the largest and the smallest eigenvalue.
When this number is large, matrix inversion is particularly sensitive to error in input.

Gradient-Based Optimization

The function we want to minimize or maximize is called the objective function or criterion. When we minimizing it, we call it the cost function, the loss function, or error function.

We denote the value that minimizes or maximizes a function with a superscript *.

Types of critical points(a point with 0 slope):

Maximum
Minimum
Saddle Point

Appoximate minimization. Optimization algorithms may fail to find a global minimum when there are multiple local minima or plateous present. In the context of deep learning, we generally accept such solutions even though they are not truly minimal, so long as they correspond to significantly low values of the cost function.

Now considering the n-D function f, we can decrease f by moving in the direction of the negative gradient. And this is the method of steepest descent, or gradient descent.

Epsilon is a positive scalar determining the size of the step, which is known as the learning rate. We can use line search to find the proper scalar.

Beyond the gradient: Jacobian and Hessian Matrices

When we have to face the situation that both the input and the output are vectors, the matrix containing all such partial direvatives is known as Jacobian Matrix.

And when we consider about the second direvatives, it comes up with the idea that using Hessian matrix.

Since most situation we meet in the machine learning is that Hessian matrix is a symmetric, so we can make eigen value decomposition. A second derivative in a specific direction can decomposited by this eigen vectors. The more close on the angel between the eigen vector and the unit of the direction, the more weight correspond eigen value gets.

Analysing the Gradient Descent by Taylor series approximation, the term that account for the curvature of the function may influence the result. So we must find the solution resort to more heuristic choices.

Solving for the optimal step size that decreases the Taylor series approximation of the function the most yields:

To the extent that the function we minimize can be approximated well by a quadratic function, the eigenvalues of the Hessian thus determine the scale of the learning rate.

In multiple dimensions, the condition number of the Hessian at this point measures how much the second derivatives differ from each other. When the Hessian has a poor condition number, gradient descent performs poorly.

In one direction, the derivative increases rapidly, while in another direction, it increases slowly. So the Gradient descent does not know that it needs to explore preferentially in the direction where the direvative remains negative for longer.
Poor condition also makes choosing a good step size difficult. The step size must be small enough to avoid overshooting the minimum and going uphill in directions with strong positive curvature.

This issue can be resolved with the guidance from the Hessian. The simplest method is Newton's method which is based on the second-order Taylor series expansion to approximate f(x) near some point.

Iteratively updating the approximation and jumping to the minimum of the approximation can reach the critical point much faster. But it can be harmful for paddle point. The gradient descent is not attracted to saddle points unless the gradient points toward them.

Constrained Optimization

The Karush-Kuhn-Tucker approach provides a very general solution to constrained optimization. We use apply generalized Lagrange function to find a solution.

g(x) satisfy the equality constraints and h(x) satisfy the inequality constraints.

And this have the same optimal objective function value and set of optimal points set of the orginal problem.

Example: Linear Least Squares

An algorithm to minimize

using gradient descent, starting from an arbitrary x.

While the gradient of f(x) bigger than tolerance, iterate delta with step size.

posted @ 2019-12-07 14:51 IdiotNe 阅读(140) 评论(0) 编辑收藏举报

刷新页面返回顶部

IdiotNe