Deep Learning Flower Book 1
原来放在了个人博客中,现在想想把他迁移到博客园,这样个人博客更多的original的知识也好。
Linear Algbera
Overview
I just make notes about the unfamiliar concept of Linear Algbera.
Tensor
Sometimes, we will discuss about the array with more than two axes. In the general cases, an array of numbers arranged on a regular grid with a variable number of axes is known as a tensor.
Norms
To measure a size of a vector in machine learning, we use a function named norms to achieve this.
for
Norms, are functions mapping vectors to non-negative values. More rigorously, a norm is any function f that satisfies the following property.
- (the triangle inequality)
The norm with p=2 is known as Euclidean norm, which is simply the Euclidean distanc from the origin to the point identified by the vector.
It is so frequently used in the machine learning that we always omit the 2 when we denote this kind of norm.
In several machine learning applications, it is important to discriminate between elements that are exactly zero and elements that are small but nonzero. So, we consider the norm with p=1.
Sometimes, we will discuss the max norms:
Because, it simplifies to the the absolute value of the element with the largest magnitude in the vector.
In some cases, we may also with to measure the size of a matrix, the most common way to do this is with other obscure Frobenius norm:
Singular value decomposition (SVD)
SVD provides a way to factorize a matrix into singular vectors and singular values. And SVD is more generally applicable than eigenvalue decomposition.
Suppose that A is a m * n matrix, U is an m * m matrix, D is an m * n matrix, V is an n * n matrix. D is a diagonal matrix. Both U and V are orthogonal
matrix.
The elements along the diagonal of D are known as the singular values of the matrix of A. The columns of U are known as left singular vectos and the row of V are known as right singular vectors.
We can actually interpret the singular value decomposition of A in terms of the eigendecomposition of functions of A. The left-singular vectors of A are the eigen vectors of the matrix product of A and the transpose of A.
The Moore-Penrose Pseudoinverse
The pseudoinverse of A is defined as a matrix
Pratically algorithms for computing the pseudoinverse are based on this formula,
where U, D and V are the singular value decomposition of A. And the pseudoinverse of D is obtained by taking the reciprocal of its nonzero elements then taking transpose of the resulting matrix.
When A has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions. Specifically, it provides the solution with mimal Euclidean norm of x among all possible solutions.
When A has more rows than columns, it is possilbe for there to be no solution. I this case, using the pseudoinverse gives us the x for Ax is as close as possible to y in terms of Euclidean norm of Ax-y.
The trace operator
The trace of a square matrix composed of many factors is also invariant to moving the last factor into the fisrt position, if the shape of corresponding matrices allow the resulting matrix to be defined:
This invariance to cyclic permutation holds even if the resulting product has a different shape.
The principal components analysis(PCA)
Suppose we have a collection of m points of n-D. We want to apply lossy compression to these points. Lossy compression means storing the points in a way that require less memory but may lose some precision. We want to low down the loss as small as possible.
We use matrix multiplication to make decoding function and we constrain the columns of D to be orthogonal to each other. To give the problem a unique solution, we constrain all the columns of D have unit form.
We can use squared Euclidean norm to measure the decoding function.
After the mathematical derivation, we convert the problem to this:
So we can define the PCA reconstruction:
To choose the proper D, and we can no longer consider the points in isolation. We must minimize the Frobenius norm of the matrix of errors computed over all dimensions and all points:
And the multiplication of the transpose of D and D equal to I which is at l dimension.
To simplify the problem, we let l to be 1 and D convert to be a single vector.
We can rewrite the problem as:
Finally, we can reintroduce the problem in this way:
It also means:
And it may be solved using eigendecomposition. Specifically, the optimal d is given by the eigen vector of corresponding to the largest eigen value.
This derivation is specific to the case of l=1 and recovers only the first principal component. More generally, when we wish to recover a basis of principal components, the matrix D is given by the l eigenvectors corresponding to the largest eigenvalues.