机器学习学习笔记 PRML Chapter 2.0 : Prerequisite 2 -Singular Value Decomposition (SVD)

Chapter 2.0 : Prerequisite 2 -Singular Value Decomposition (SVD)

Christopher M. Bishop, PRML, Chapter 2 Probability Distributions

1. Vector Terminology

Orthogonality
Two vectors and are said to be orthogonal to each other if their inner product equals zero, i.e.,
Normal Vector
A normal vector (or unit vector ) is a vector of length 1, i.e.,
Orthonormal Vectors
Vectors of unit length that are orthogonal to each other are said to be orthonormal.

2. Matrix Terminology

2.1 Orthogonal Matrix

A matrix is orthogonal if

where

is the identity matrix.

2.2 Eigenvectors and Eigenvalues

An eigenvector is a nonzero vector that satisfies the equation

where

is a square matrix,

the scalar is an eigenvalue, and
is the eigenvector.

Eigenvalues and eigenvectors are also known as, respectively, characteristic roots(特征值) and characteristic vectors(特征向量), or latent roots and latent vectors.

THE KEY IDEAS [see Ref-7]:

says that eigenvectors keep the same direction when multiplied by .
also says that . This determines eigenvalues.
The eigenvalues of and are and , respectively, with the same eigenvectors.
The sum of the ’s equals the sum down the main diagonal of (the trace), i.e.,
The product of the ’s equals the determinant, i.e.,

2.3 Understanding eigenvectors and eigenvalues in terms of transformation and the corresponding matrix [see Ref-9]

In linear algebra, an eigenvector or characteristic vector of a linear transformation from a vector space over a field into itself is a non-zero vector that does not change its direction when that linear transformation is applied to it. In other words, if is a vector that is not the zero vector, then it is an eigenvector of a linear transformation if is a scalar multiple of . This condition can be written as the mapping

where

is a scalar in the field

, known as the eigenvalue or characteristic value associated with the eigenvector

If the vector space is finite-dimensional, then the linear transformation can be represented as a square matrix , and the vector by a column vector, rendering the above mapping as a matrix multiplication on the left hand side and a scaling of the column vector on the right hand side in the equation

There is a correspondence between by square matrices and linear transformations from an n-dimensional vector space to itself. For this reason, it is equivalent to define eigenvalues and eigenvectors using either the language of matrices or the language of linear transformations.

Geometrically, an eigenvector corresponding to a real, nonzero eigenvalue points in a direction that is stretched by the transformation and the eigenvalue is the factor by which it is stretched. If the eigenvalue is negative, the direction is reversed.

It can be shown in the following figure, where matrix acts by stretching the vector , not changing its direction, so is an eigenvector of .

Alt text|center

3. Singular Value Decomposition

3.1 Understanding of SVD

Singular value decomposition (SVD) can be looked at from three mutually compatible points of view.

1) a method for transforming correlated variables into a set of uncorrelated ones that better expose the various relationships among the original data items.
2) a method for identifying and ordering the dimensions along which data points exhibit the most variation.
3) a method for data reduction, since once we have identified where the most variation is, it’s possible to find the best approximation of the original data points using fewer dimensions.

3.2 Statement of the SVD Theorem

SVD is based on a theorem from linear algebra which says that a rectangular matrix can be broken down into the product of three matrices:

an orthogonal matrix (i.e., );
a diagonal matrix ;
the transpose of an orthogonal matrix (i.e, ).

The theorem is usually presented something like this:

assuming [see Ref-4 for this figure]:

Alt text|center

assuming [see Ref-4 for this figure]:
The columns of and the columns of are called the left-singular vectors and right-singular vectors of , respectively.
The columns of are orthonormal eigenvectors of .
There is a brief proof. Let , where the column vector , for , with .

Firstly, to calculate the product of :

The LHS of (3.1) equals:

Substitute (3.2) into (3.1) to generate the RHS of (3.1):

You can testify the second line of (3.4) by listing all the elements of the column vectors, and doing matrix production based on the matrix product rule. Therefore (3.3) and (3.4) give us the following euqation
Similarly, we can prove that the columns of are orthonormal eigenvectors of ,
is a diagonal matrix containing the square roots of non-zero eigenvalues of both and . A common convention is to list the singular values in descending order. In this case, the diagonal matrix is uniquely determined by (though not the matrices and ).

assuming , with , where is called the singular values of the matrix .
is the rank of matrix , i.e., , where , means the range of , that is the set of possible linear combinations of the columns of .

Some Conclusion and Simple Proof:

Let , where , for ; and , where , for .

where

.
Similarly, we have

where

.
That is, the columns of

and

are orthonormal vectors, respectively.

3.3 An example of SVD:

To calculate via finding the eigenvalues and corresponding eigenvectors of , to give
To calculate via finding the eigenvalues and corresponding eigenvectors of , to give
S =
SVD result is

3.4 Intuitive Interpretations of SVD [see Ref-5]

1) Points in d-dimension Space:

To gain insight into the SVD, treat the rows of an (here we use instead of , since it is common to be used to represent those n points of d-dimension) matrix as points in a d-dimensional space.

is equivalent to

where the inner product

means the projection of point

(represented by column vector

, i.e., the

row of matrix A) onto the line along which

is a unit vector.

2) The Best Least Squares Fit Problem:

Consider the problem of finding the best k-dimensional subspace with respect to the set of points. Here “best” means minimize the sum of the squares of the perpendicular distances of the points to the subspace. We begin with a special case of the problem where the subspace is 1-dimensional, a line through the origin. We will see later that the best-fitting k-dimensional subspace can be found by k applications of the best fitting line algorithm (i.e., 应用k次1-dim直线fitting即可得到the fitting k-dim subspace). Finding the best fitting line through the origin with respect to a set of points in the plane means minimizing the sum of the squared distances of the points to the line. Here distance is measured perpendicular to the line (the corresponding problem is called the best least squares fit), or more often measured vertical in the y direction, to the subspace of (with the corresponding problem - least squares fit).

Returning to the best least squares fit problem, consider projecting a point onto a line through the origin. Then based on the following figure Alt text|center
we can get

From (3.9) and the observation that is a constant ( i.e., independent of the line), we get the equivalence

So minimizing the sum of the squares of the distances is equivalent to maximizing the sum of the squares of the lengths of the projections onto the line. This conclusion helps to introduce the subsequent definition of singular vectors.

3) Singular Vectors and Singular Values:

Singular Vectors: Consider the rows of as points in a d-dimensional space. Consider the best fit line through the origin. Let be a unit vector along this line. The length of the projection of (i.e., the row of ) onto is . From this we see that the sum of length squared of the projections is . The best fit line is the one maximizing and hence minimizing the sum of the squared distances of the points to the line.
The First Singular Vector: With this in mind, define the first singular vector, of , which is a column vector, as the best fit line through the origin for the points in d-space that are the rows of . Thus
The First Singular Value: The value is called the first singular value of . Note that is the sum of the squares of the projections of the points to the line determined by .
The Second Singular Vector: The second singular vector , is defined by the best fit line perpendicular to
The Second Singular Value: The value is called the second singular value of . Note that is the sum of the squares of the projections of the points to the line determined by .
The Third Singular Vector: The third singular vector is defined similarly by
The process stops when we have found as singular vectors and

where , i.e, there exist at most linearly independent eigenvectors.

4) The Frobenius norm of A:

Consider one row, say of matrix . Since span the space of all rows of , 0 for all perpendicular to . Thus, for each row , . Summing over all rows,

But

that is the sum of squares of all the entries of

. Thus, the sum of squares of the singular values of

is indeed the square of the “whole content of

”, i.e., the sum of squares of all the entries. There is an important norm associated with this quantity, the Frobenius norm of

, denoted by

, defined as

It is shown is the following lemma:

3.5 Intuitive Interpretations of SVD [see Ref-6]

Alt text|center

1) The image shows:

Upper Left: The unit disc with the two canonical unit vectors.
Upper Right: Unit disc transformed with M and singular Values and indicated.
Lower Left: The action of on the unit disc. This is just a rotation. Here means conjugate transpose.
Lower Right: The action of on the unit disc. scales in vertically and horizontally.
In this special case, the singular values are and where is the Golden ratio, i.e.,

is a (counter clockwise) rotation by an angle where satisfies . is a rotation by an angle with .

2) Singular values as semiaxes of an ellipse or ellipsoid:

As shown in the figure, the singular values can be interpreted as the semiaxes of an ellipse in 2D. This concept can be generalized to n-dimensional Euclidean space, with the singular values of any square matrix being viewed as the semiaxes of an n-dimensional ellipsoid. See below for further details.

3) The columns of U and V are orthonormal bases:

Since and are unitary, the columns of each of them form a set of orthonormal vectors, which can be regarded as basis vectors. The matrix maps the basis vector to the stretched unit vector . By the definition of a unitary matrix, the same is true for their conjugate transposes and , except the geometric interpretation of the singular values as stretches is lost. In short, the columns of , , and are orthonormal bases.

4. Expansion of eigenvalues and eigenvectors [see Ref-8]

Problem - PRML Exercise 2.19:

Show that a real, symmetric matrix satisfying the eigenvector equation cam be expressed as an expansion of its eigenvalues and eigenvectors of the following form

and similarly, the inverse

can be expressed as

Solution:

1) Lemma 4-1: 实对称矩阵正交相似于对角矩阵。即：为实对称方阵正交矩阵, such that

or due to

, equivalent equations include

and

2) Lemma 4-2: Matrix and are identical if and only if for all vectors , . That is,

3) Proof:

The proof of (4.1) and (4.2) use (4.5) and (4.6). For any column vector ,
we have

Since the inner product in (4.7) is a scalar, and is also a scalar, therefore we can change the order of the terms,

Thus applying the Lemma 2 shown in (4.6) to (4.8), we can prove (4.1).

Since , inverting both sides gives , and hence . Applying the above result to , noting that is just the diagonal matrix of the inverses of the diagonal elements of , we have proved (4.2).

5. Best Rank k Approximation using SVD [see Ref-5]

Let be an matrix and think of the rows of as points in d-dimensional space. There are two important matrix norms, the Frobenius norm denoted and the 2-norm denoted .

The 2-norm of the matrix A is given by

and thus equals the largest singular value of the matrix. That is, the 2-norm is the square root of the sum of squared distances to the origin along the direction that maximizes this quantity.
The Frobenius norm of is the square root of the sum of the squared distance of the points to the origin, shown in (3.17).

Let and

be the SVD of

. For

, let

be the sum truncated after

terms. It is clear that

has rank

. Furthermore,

is the best rank

approximation to

when the error is measured in either the 2-norm or the Frobenius norm (see Theorem 5.2 and Theorem 5.3).
Without proof, we give the following theorems (if interested, please check Lemma 1.6, Theorem 1.7, Theorem 1.8, and Theorem 1.9 in page 9-10 of Ref-5).

Theorem 5.1:

The rows of matrix are the projections of the rows of onto the subspace spanned by the first singular vectors of .

Theorem 5.2:

Let be an matrix, for any matrix of rank at most , it holds that

Theorem 5.3:

Let be an matrix, for any matrix of rank at most , it holds that

Theorem 5.4:

Let be an matrix, for in (5.2) it holds that

6. The Geometry of Linear Transformations [see Ref-3]

6.1 Matrix and Transformation

Let us begin by looking at some simple matrices, namely those with two rows and two columns. Our first example is the diagonal matrix

Geometrically, we may think of a matrix like this as taking a point in the plane and transforming it into another point using matrix multiplication:

The effect of this transformation is shown below : the plane is horizontally stretched by a factor of , while there is no vertical change.

Alt text|center

Now let’s look at

The four vertices of the red square shown in the following figure,

are transformed into

, respectively, which produces this effect

Alt text|center

It is not so clear how to describe simply the geometric effect of the transformation. However, let’s rotate our grid through a angle and see what happens. The four vertices of the red square, are transformed into , respectively, which produces this effect

Alt text|center

We see now that this new grid is transformed in the same way that the original grid was transformed by the diagonal matrix: the grid is stretched by a factor of in one direction.

This is a very special situation due to the fact that the matrix is symmetric, i.e., . If we have a symmetric matrix, it turns out that

we may always rotate the grid in the domain so that the matrix acts by stretching and perhaps reflecting in the two directions. In other words, symmetric matrices behave like diagonal matrices.

结论：

以上的几张图，就是为了讨论given a symmetric matrix , 即

如何放置坐标grid（或者说如何确定一个单位长度的正方形在坐标系中的位置和方向，要知道这个正方形可以用两个彼此互相垂直的单位向量和来表示），使得当该正方形被施加transformation（represented by a symmetric matrix ）时，这个正方形的形变发生沿着和方向的单纯的拉伸或压缩。这就与后面即将讨论的矩阵的特征向量和特征值联系起来。即：

表示特征向量被矩阵变换之后，新的向量与原来向量平行（包括同向和反向），只是模长发生了改变而已。

如何求得这样的和呢？答案就是当为对称矩阵时（当然，对称矩阵是一种特殊情况，接下来我们会讨论更为一般的矩阵），这样的和就是对称矩阵的两个特征向量。即由，求得特征向量和特征值为：

which accords with the rotation of the red sqaure shown above.

对于这种特殊的对称矩阵, 它的SVD就演变成了 Lemma 4-1: 实对称矩阵正交相似于对角矩阵，正如（4.5）所示。可以把它看成是SVD的一种特殊情况，即：对于矩阵, 有如下SVD:

对于一般的矩阵, 存在正交矩阵和（即）, 使得

对于(6.2)，即为的特征向量组成，即为的特征向量组成，对角矩阵由（或者）的特征值的正平方根构成。

当是实对称矩阵时，存在正交矩阵（即）, 使得

对于(6.3)，即为对称矩阵的特征向量组成，对角矩阵为对称矩阵的特征值构成。当然也可以通过上面介绍的方法求解，即是由的特征向量组成，对角矩阵由的特征值的正平方根构成。两种方法是等价的、是一致的。

6.2 The Geometry of Eigenvectors and Eigenvalues

Said with more mathematical precision, given a symmetric matrix , we may find a set of orthogonal vectors so that is a scalar multiple of ; that is

where

is a scalar.

Geometrically, this means that the vectors are simply stretched and/or reflected(即方向改变了180°) when multiplied by . Because of this property, we call

Eigenvectors: the vectors eigenvectors of ;
Eigenvalues: the scalars are called eigenvalues.

An important fact, which is easily verified, is that eigenvectors of a symmetric matrix corresponding to different eigenvalues are orthogonal. If we use the eigenvectors of a symmetric matrix to align the grid, the matrix stretches and/or reflects the grid in the same way that it does the eigenvectors.

The geometric description we gave for this linear transformation is a simple one: the grid is simply stretched in one direction. For more general matrices, we will ask if we can find an orthogonal grid that is transformed into another orthogonal grid. Let’s consider a final example using a matrix that is not symmetric:

This matrix produces the geometric effect known as a shear, shown as

Alt text|center

It’s easy to find one family of eigenvectors along the horizontal axis. However, our figure above shows that these eigenvectors cannot be used to create an orthogonal grid that is transformed into another orthogonal grid.

Nonetheless, let’s see what happens when we rotate the grid first by , shown as
Notice that the angle at the origin formed by the red parallelogram on the right has increased.
Let’s next rotate the grid by .
It appears that the grid on the right is now almost orthogonal.
In fact, by rotating the grid in the domain by an angle of roughly , both grids are now orthogonal.

Alt text|center

How to calculate this angle of roughly ?

Solution:

Based on the discussion in (6.2), The columns of are the eigenvectors of , results in , and

where

We can get

, the corresponding eigenvectors are

where the directions of

and

are ( You can run Matlab function

to get the result as follows)

6.3 The singular value decomposition

This is the geometric essence of the singular value decomposition for matrices:

for any matrix, we may find an orthogonal grid that is transformed into another orthogonal grid. We will express this fact using vectors:

with an appropriate choice of orthogonal unit vectors and , the vectors and are orthogonal.

Alt text|center

We will use and to denote unit vectors in the direction of and . The lengths of and – denoted by and – describe the amount that the grid is stretched in those particular directions. These numbers are called the singular values of . (In this case, the singular values are the golden ratio and its reciprocal, but that is not so important here.)

Alt text|center

We therefore have

We may now give a simple description for how the matrix treats a general vector . Since the vectors and are orthogonal unit vectors, we have

This means that

Remember that the inner dot product may be computed using the vector transpose

which leads to

This is usually expressed by writing

where

is a matrix whose columns are the vectors

and

is a diagonal matrix whose entries are

and

, and

is a matrix whose columns are

and

This shows how to decompose the matrix into the product of three matrices:

describes an orthonormal basis in the domain (定义域), and
describes an orthonormal basis in the co-domain (值域), and
describes how much the vectors in are stretched to give the vectors in .

6.4 How do we find the singular decomposition?

The power of the singular value decomposition lies in the fact that we may find it for any matrix. How do we do it? Let’s look at our earlier example and add the unit circle in the domain (定义域). Its image will be an ellipse whose major and minor axes define the orthogonal grid in the co-domain (值域).

Alt text|center

Notice that the major and minor axes are defined by and . These vectors therefore are the longest and shortest vectors among all the images of vectors on the unit circle.

Alt text|center

In other words, the function on the unit circle has a maximum at and a minimum at . This reduces the problem to a rather standard calculus problem in which we wish to optimize a function over the unit circle. It turns out that the critical points of this function occur at the eigenvectors of the matrix . Since this matrix is symmetric (since it is obvious that ), eigenvectors corresponding to different eigenvalues will be orthogonal. This gives the family of vectors .

The singular values are then given by , and the vectors are obtained as unit vectors in the direction of .

But why are the vectors orthogonal? To explain this, we will assume that and are distinct singular values. We have

Let’s begin by looking at the expression and assuming, for convenience, that the singular values are non-zero.

On one hand, this expression is zero due to the orthogonal-to-one-another vectors s’ and s’, which are required to be eigenvectors of the symmetric matrix , i.e.,

Therefore,

On the other hand, we have

Therefore, and are orthogonal, so we have found an orthogonal set of vectors that is transformed into another orthogonal set . The singular values describe the amount of stretching in the different directions.

In practice, this is not the procedure used to find the singular value decomposition of a matrix since it is not particularly efficient or well-behaved numerically.

6.5 Another example

Let’s now look at the singular matrix

We can get , the corresponding eigenvectors are

where the directions of

and

are ( You can run Matlab function

to get the result as follows)

The geometric effect of this matrix is the following:

Alt text|center

In this case, the second singular value is zero so that we may write:

In other words, if some of the singular values are zero, the corresponding terms do not appear in the decomposition for . In this way, we see that the rank of , which is the dimension of the image of the linear transformation, is equal to the number of non-zero singular values.

6.6 SVD Application 1 – Data compression

Singular value decompositions can be used to represent data efficiently. Suppose, for instance, that we wish to transmit the following image, which consists of an array of black or white pixels.

Alt text|center

Since there are only three types of columns in this image, as shown below, it should be possible to represent the data in a more compact form.

Alt text|center

We will represent the image as a matrix in which each entry is either a 0, representing a black pixel, or 1, representing white. As such, there are entries in the matrix. If we perform a singular value decomposition on , we find there are only three non-zero singular values
Therefore, the matrix may be represented as

This means that we have three vectors , each of which has entries, three vectors , each of which has entries, and three singular values . This implies that we may represent the matrix using only numbers rather than the that appear in the matrix. In this way, the singular value decomposition discovers the redundancy in the matrix and provides a format for eliminating it.

Why are there only three non-zero singular values? Remember that the number of non-zero singular values equals the rank of the matrix. In this case, we see that there are three linearly independent columns in the matrix, which means that .

6.7 SVD Application 2 – Noise reduction

The previous example showed how we can exploit a situation where many singular values are zero. Typically speaking, the large singular values point to where the interesting information is. For example, imagine we have used a scanner to enter this image into our computer. However, our scanner introduces some imperfections (usually called “noise“) in the image.

Alt text|center
We may proceed in the same way: represent the data using a matrix and perform a singular value decomposition. We find the following singular values:

Clearly, the first three singular values are the most important so we will assume that the others are due to the noise in the image and make the approximation

This leads to the following improved image.

Alt text|center

6.8 SVD Application 3 – Data analysis

Noise also arises anytime we collect data: no matter how good the instruments are, measurements will always have some error in them. If we remember the theme that large singular values point to important features in a matrix, it seems natural to use a singular value decomposition to study data once it is collected. As an example, suppose that we collect some data as shown below:

Alt text|center
We may take the data and put it into a matrix:

and perform a singular value decomposition. We find the singular values

With one singular value so much larger than the other, it may be safe to assume that the small value of is due to noise in the data and that this singular value would ideally be . In that case, the matrix would have rank one meaning that all the data lies on the line defined by .

Alt text|center
This brief example points to the beginnings of a field known as principal component analysis (PCA), a set of techniques that uses singular values to detect dependencies and redundancies in data.

In a similar way, singular value decompositions can be used to detect groupings in data, which explains why singular value decompositions are being used in attempts to improve Netflix’s movie recommendation system. Ratings of movies you have watched allow a program to sort you into a group of others whose ratings are similar to yours. Recommendations may be made by choosing movies that others in your group have rated highly.

8. Reference

[1]: Kirk Baker, Singular Value Decomposition Tutorial, https://www.ling.ohio-state.edu/~kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf;
[2]: Singular Value Decomposition (SVD) tutorial, http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm;
[3]: We Recommend a Singular Value Decomposition, http://www.ams.org/samplings/feature-column/fcarc-svd;
[4]: Computation of the Singular Value Decomposition, http://www.cs.utexas.edu/users/inderjit/public_papers/HLA_SVD.pdf;
[5]: CMU, SVD Tutorial, https://www.cs.cmu.edu/~venkatg/teaching/CStheory-infoage/book-chapter-4.pdf.
[6]: Wiki: Singular value decomposition, https://en.wikipedia.org/wiki/Singular_value_decomposition.
[7]: Chapter 6 Eigenvalues and Eigenvectors, http://math.mit.edu/~gs/linearalgebra/ila0601.pdf.
[8]: Expressing a matrix as an expansion of its eigenvalues, http://math.stackexchange.com/questions/331826/expressing-a-matrix-as-an-expansion-of-its-eigenvalues.
[9]: Wiki: Eigenvalues and eigenvectors, https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors.

posted @ 2016-07-06 01:31 GloryOfFamily 阅读(1160) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

GloryOfFamily

机器学习学习笔记 PRML Chapter 2.0 : Prerequisite 2 -Singular Value Decomposition (SVD)

Chapter 2.0 : Prerequisite 2 -Singular Value Decomposition (SVD)

1. Vector Terminology

2. Matrix Terminology

2.1 Orthogonal Matrix

2.2 Eigenvectors and Eigenvalues

THE KEY IDEAS [see Ref-7]:

2.3 Understanding eigenvectors and eigenvalues in terms of transformation and the corresponding matrix [see Ref-9]

3. Singular Value Decomposition

3.1 Understanding of SVD

3.2 Statement of the SVD Theorem

Some Conclusion and Simple Proof:

3.3 An example of SVD:

3.4 Intuitive Interpretations of SVD [see Ref-5]

1) Points in d-dimension Space:

2) The Best Least Squares Fit Problem:

3) Singular Vectors and Singular Values:

4) The Frobenius norm of A:

3.5 Intuitive Interpretations of SVD [see Ref-6]

1) The image shows:

2) Singular values as semiaxes of an ellipse or ellipsoid:

3) The columns of U and V are orthonormal bases:

4. Expansion of eigenvalues and eigenvectors [see Ref-8]

Problem - PRML Exercise 2.19:

Solution:

1) Lemma 4-1: 实对称矩阵正交相似于对角矩阵。即： 为实对称方阵 正交矩阵, such that

2) Lemma 4-2: Matrix and are identical if and only if for all vectors , . That is,

3) Proof:

5. Best Rank k Approximation using SVD [see Ref-5]

Theorem 5.1:

Theorem 5.2:

Theorem 5.3:

Theorem 5.4:

6. The Geometry of Linear Transformations [see Ref-3]

6.1 Matrix and Transformation

结论：

6.2 The Geometry of Eigenvectors and Eigenvalues

How to calculate this angle of roughly ?

Solution:

6.3 The singular value decomposition

6.4 How do we find the singular decomposition?

6.5 Another example

6.6 SVD Application 1 – Data compression

6.7 SVD Application 2 – Noise reduction

6.8 SVD Application 3 – Data analysis

8. Reference

公告

1) Lemma 4-1: 实对称矩阵正交相似于对角矩阵。即：为实对称方阵正交矩阵, such that