Study notes for Principal Component Analysis

Motivations for Dimensionality Reduction

  • Data Comparison
    • Speed up algorithms. By reducing a large number (e.g. 10, 000) of dimensionality of feature space, a learning algorithm may be too slow to be useful. With PCA, we can reduce the dimensionality and make it tractable. Typically, you can reduce data dimensionality by 5-10x without a major hit to algorithm.
    • Reduce disk/memory space used by data
    • Reduce highly correlated (redundant) features; hence better represent original data (by transforming from x to z space). 
    • Note that the number of training examples is not reduced, but the dimensionality of feature vectors for each training example (i.e., the number of features). 
    • In practice, we'd normally try and do 1000D --> 100D. 
  • Visualization
    • Represent data in 2D/3D space. It is hard to visualize highly dimensional data. 
    • Visualization helps us understand and interpret our data because we focus on the two or three main dimensions of variations. 

Principal Component Analysis (PCA)

  • It is the most commonly used technique for the dimensionality reduction problem.
  • You should normally do mean normalization and feature scaling on your data before PCA.
  • PCA aims to find a lower dimensional space such that the sum of squared projection error is minimized.
  • Formally, to reduce from nD to kD, we find k vectors u=(u(1), u(2), ..., u(k)) onto which to project the data to minimize the projection error. Those vectors represent a new space.

The PCA Algorithm

  1. Compute the covariance matrix:
    This is commonly denoted as (greek uppercase sigma) - NOT summation symbol. It is a nxn matrix and the example x(i) is a nx1 matrix. In matlab, it is implemented by:
    Q: Why to calculate the (expensive) covariance matrix?
    • For covariance matrix, the exact value of each entry is not as important as its sign. 
      • A positive value indicates that both dimensions increase or decrease together. E.g. as the number of hours studied increases, the grades in that subject also increase.
      • A negative value indicates while one increases the other decreases, or vice-versa. E.g. active social life vs. performance in computer science. 
      • If covariance is zero: the two dimensions are independent of each other. E.g. heights of students vs. grades obtained in a subject.
    • Covariance calculations are used to find relationships between dimensions in high dimensional data sets where visualization is difficult.
  2. Compute eigen vectors of the covariance matrix: [U, S, V]=svd(sigma), where U matrix is also an nxn matrix, and it turns out that the columns of U are the u vectors we expect. In other words, we can take the first k columns from U to form u, i.e., u = U(:, 1:k).
  3. Transform x to z space: , where z is a k x 1 matrix. Note that to recover from z to x:. Note that we lose some of the information. i.e., not all X can be perfectly recovered from the Z space. 
  4. How to determine k value (=number of principal components)? Guideline: to retain 99% of the variance of the original data. In other words, the ratio between the average squared projection error with total variation in data should be less than 0.01: 
    For implementation, we adopt the below equaiton instead:
    where is the diagonal elements of matrix S. Choose the minimal value k when the above equation satisfies. 
    Q: Why choose the first k dimensions with the greatest variance.
    A: The underlying assumption is that large variances have important dynamics. Hence, principal components with larger associated variances represent interesting dynamics, while those with lower variances represent noise.In other words, after feature transformation (from x to z), the most important features in the new space are the first k vectors. 

Related to linear regression?

  • PCA is NOT linear regression. Despite cosmetic similarities, very different.
  • For linear regression, it is a supervised learning algorithm which aims to find a straight line to minimize the straight line between a point and a squared line. The objective is to find a fitting line that predicts y values accurately. It works on the training space (x(i), y(i), where i=1, ..., m). 
  • For PCA, it is a unsupervised learning algorithm which aims to minimize the magnitude of the shortest orthogonal distance, the projection error. The objective is to reduce the dimensionality of feature space but maintain high variance of original data. It works on the feature space x(i)=(x1, ..., xn);

Advice for Applying PCA

  • DO NOT use PCA to prevent over-fitting. Although reducing the number of feature space, it is less likely to over-fit, it is not a good way to address over-fitting. It is always suggested to use regularization instead. One important reason is that PCA throws away some data without knowing what the values it's losing.
  • Always try your learning algorithm on the original data first. ONLY if you find that it takes long time to train, you can use PCA to reduce dimensionality and speed up the algorithm. 
  • PCA is easy enough to add on as a pre-processing step. 

 

posted @ 2013-06-12 18:32  爱生活,爱编程  阅读(197)  评论(0编辑  收藏  举报