xinyu04

导航

Deep Learning 深度学习 Notes Chapter 4 Numerical Computation

1. Overflow and underflow

  • Underflow occurs when numbers near zero are rounded to zero
  • Overflow occurswhen numbers with large magnitude are approximated as \(-\infty\) or \(\infty\)

\(\\\)

\(\Large\text{Note:}\) One example of a function that must be stabilized against underflow and overflow is the \(\textbf{softmax}\) function. The \(\textbf{softmax}\) function is often used to predict the probabilities associated with a multinoulli distribution:

\[\text{softmax}(\mathbf{x})_i = \frac{\exp(x_i)}{\sum_{j=1}^n \exp(x_j)} \]

2. Pool Conditioning

Consider a function \(f(x) = A^{-1}x\). When \(A\in\mathbb{R}^{n\times n}\) has eigenvalue decomposition, the \(\textbf{Condition Number }\)is:

\[\max_{i,j}\left |\frac{\lambda_i}{\lambda_j} \right| \]

This is the ratio of the magnitude of the largest and smallest eigenvalue. Whenthis number is large, matrix inversion is particularly sensitive to error in the input.

3. Jacobian and Hessian Matrices

Sometimes we need to find all the partial derivatives of a function whose input and output are both vectors. The matrix containing all such partial derivatives is known as a \(\textbf{Jacobian matrix}\). Specifically, if we have a function: \(f:\mathbb{R}^m\rightarrow \mathbb{R}^n\), then the Jacobian matrix \(\mathbf{J}\in\mathbb{R}^{n\times m}\) of \(f\) is defined as :

\[J_{i,j} = \frac{\partial f(x)_i}{\partial x_j} \]

\(\textbf{Hessian}\) matrix \(\mathbf{H}(f)(x)\) is defined as:

\[\mathbf{H}(f)(x)_{i,j} = \frac{\partial^2 f(x)}{\partial x_i\partial x_j} \]

Equivalently, the \(\textbf{Hessian}\) is the Jacobian of the gradient.

\(\Large\textbf{Note: the differential operators are commutative}\)

\[\frac{\partial^2 f(x)}{\partial x_i\partial x_j} = \frac{\partial^2 f(x)}{\partial x_j\partial x_i} \]

This implies that \(H_{i,j} = H_{j,i}\) \(\Rightarrow \textbf{Symmetric}\).

Because the Hessian matrix is real and symmetric, we can decompose it into a set of real \(\textbf{eigenvalues}\) and an \(\textbf{orthogonal basis}\) of eigenvectors.

4. Constrained Optimization

The Karush–Kuhn–Tucker(KKT) approach provides a very general solution to constrained optimization. In detail,

\[L(x,\lambda,\alpha) = f(x)+\sum_i\lambda_ig^{(i)}(x)+\sum_j\alpha_jh^{(j)}(x) \]

where \(f(x)\) is the objective function, and \(g^{(i)}(x)=0, \forall i\) is the \(\textbf{equality constraint}\), \(h^{(j)}(x)\leq 0, \forall j\) is the \(\textbf{inequality constraint}\).

As long as at least one feasible point exists and \(f(x)\) is not permitted to have a value \(\infty\), then

\[\min_x\max_{\lambda}\max_{\alpha\ge 0}L(x,\lambda,\alpha) \]

has has the same optimal objective function value and set of optimal points \(x\) as

\[\min_x f(x) \]

posted on 2022-06-09 04:39  Blackzxy  阅读(19)  评论(0编辑  收藏  举报