python numpy小记2 PCA

关于利用PCA算法的过程

https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html#numpy.cov

计算数据的协方差 $X = [x_1,...x_N]^T$.

>>> x = np.array([[0, 2], [1, 1], [2, 0]]).T
>>> x
array([[0, 1, 2],
       [2, 1, 0]])
# 计算x0,x1的协方差矩阵
>>> np.cov(x)
array([[ 1., -1.],
       [-1.,  1.]])

要执行数据的PCA算法。

(1) 首先需要将数据归一化,得到数据的协方差矩阵

(2) 再计算协方差矩阵的特征值和特征向量,并进行按照特征值的大小进行排序。

(3)将数据投影到选择的PCA空间,得到对应的投影向量,与原来的初始值计算最小均方误差。

1. 归一化数据。

def normalize(X):
    """Normalize the given dataset X
    Args:
        X: ndarray, dataset
    
    Returns:
        (Xbar, mean, std): ndarray, Xbar is the normalized dataset
        with mean 0 and standard deviation 1; mean and std are the 
        mean and standard deviation respectively.
    
    Note:
        You will encounter dimensions where the standard deviation is
        zero, for those when you do normalization the normalized data
        will be NaN. Handle this by setting using `std = 1` for those 
        dimensions when doing normalization.
    """
    mu = np.zeros(X.shape[1]) # EDIT 
    mu = np.mean(X, axis=0) #按列求平均值
    std = np.std(X, axis=0) #按照列求标准差
    std_filled = std.copy()  #复制
    std_filled[std==0] = 1.  #若标准差为0,将其变成1,防止分母为0
    Xbar =  np.divide(X-mu, std_filled)            # EDIT THIS
    return Xbar, mu, std

2. np.argsort() 

x = np.array([3,1,2])
print (np.argsort(x)  )#升序排列,返回索引
# ouput:  [1,2,0]
#=========降序排列
print(np.argsort(-x)) 
#output: [0, 1, 2]

3.计算特征向量特征值

def eig(S):
    """Compute the eigenvalues and corresponding eigenvectors 
        for the covariance matrix S.
    Args:
        S: ndarray, covariance matrix
    
    Returns:
        (eigvals, eigvecs): ndarray, the eigenvalues and eigenvectors

    Note:
        the eigenvals and eigenvecs SHOULD BE sorted in descending
        order of the eigen values
        
        Hint: take a look at np.argsort for how to sort in numpy.
    """
    w, v = np.linalg.eigh(S)
    sorted_indices = np.argsort(-w) #降序排列,返回索引
    W = w[sorted_indices] #按照索引重排重排
    V = v[:, sorted_indices] #按照索引重
    return (W, V) # EDIT THIS

通过排序可以轻易得到PCA向量空间。

4.求投影矩阵

def projection_matrix(B):
    """Compute the projection matrix onto the space spanned by `B`
    Args:
        B: ndarray of dimension (D, M), the basis for the subspace
    
    Returns:
        P: the projection matrix
    """
    P = np.eye(B.shape[0]) # EDIT THIS
    P = np.dot(B,B.T)
    return P

 5. PCA算法

def PCA(X, num_components):
    """
    Args:
        X: ndarray of size (N, D), where D is the dimension of the data,
           and N is the number of datapoints
        num_components: the number of principal components to use.
    Returns:
        X_reconstruct: ndarray of the reconstruction
        of X from the first `num_components` principal components.
    """
    # Compute the data covariance matrix S
    S = np.cov(X.T)
    
    # Next find eigenvalues and corresponding eigenvectors for S by implementing eig().
    eig_vals, eig_vecs = eig(S)
    eig_vals = eig_vals[ : num_components]
    eig_vecs = eig_vecs[:, : num_components]
    
    # Reconstruct the images from the lowerdimensional representation
    # To do this, we first need to find the projection_matrix (which you implemented earlier)
    # which projects our input data onto the vector space spanned by the eigenvectors
    P = projection_matrix(eig_vecs) # projection matrix
    
    # Then for each data point x_i in the dataset X 
    #   we can project the original x_i onto the eigenbasis.
    X_reconstruct = np.zeros(X.shape)
    X_reconstruct = np.dot(X, P)
    return X_reconstruct

 6. 关于高维(N < D)数据的PCA优化

假设归一化数据矩阵为 $X \in R^{NxD},D>N$。为了做PCA需要执行以下步骤。

(1) 我们需要执行特征值/特征向量,特征矩阵$\frac{1}{N} X X^T $。需要解决 $\lambda_i,c_i$.

$\frac{1}{N} X X^T c_i = \lambda_i c_i$

(2) 与初始的特征向量$b_i$,  $\frac{1}{N} X^T X b_i = \lambda_i b_i$

(3) 左乘一个矩阵。$\frac{1}{N} X^T X X^T c_i = \lambda_i X ^T c_i$

故我们可以重新得到 $b_i = X^T c_i $ 作为协方差矩阵$S$的特征向量,以及特征值$\lambda_i$。

 

def PCA_high_dim(X, num_components):
    """Compute PCA for small sample size. 
    Args:
        X: ndarray of size (N, D), where D is the dimension of the data,
           and N is the number of data points in the training set. You may assume the input 
           has been normalized.
        num_components: the number of principal components to use.
    Returns:
        X_reconstruct: (N, D) ndarray. the reconstruction
        of X from the first `num_components` principal components.
    """
    N, D = X.shape
    M = np.zeros((N, N)) # EDIT THIS, compute the matrix \frac{1}{N}XX^T.
    M = np.dot(X, X.T) / N
    eig_vals, eig_vecs = eig(M) # EDIT THIS, compute the eigenvalues. 
    U = eig_vecs[:, : num_components] # EDIT THIS. Compute the eigenvectors for the original PCA problem.
    # Similar to what you would do in PCA, compute the projection matrix,
    # then perform the projection.
    P = projection_matrix(U) # projection matrix
    X_reconstruct = np.zeros((N, D)) # EDIT THIS.
    X_reconstruct = np.dot(P, X)
    return X_reconstruct

 






 

posted @ 2018-06-25 14:19  卷积  阅读(627)  评论(0编辑  收藏  举报