降维 Dimensionality Reduction

降维 Dimensionality Reduction

主成分分析 Principle Components Analysis, PCA

要对数据进行零均值化预处理。

对协方差矩阵进行奇异值分解,或者进行特征分解。

零均值化的数据矩阵 \(X\in \mathbb R^{m\times n}\)

\[\Sigma=X^TX \text{协方差} \\ [U,S,V]=\mathrm{svd}(\Sigma) \text{奇异值分解取得特征向量U} \\ P=U(:, 1:K) \text{主成分,K维(前K列向量)} \\ \begin{aligned} \text{方差损失率}&=\frac{\frac1m \sum_i^m \|\bm x^{(i)}-\bm x^{(i)}_{approx}\|^2}{\frac1m \sum_i^m\|\bm x^{(i)}\|^2} \\ &= 1 - \frac{\sum_i^k S_{ii}}{\sum_i^n S_{ii}} \end{aligned} \]

原数据降维后的新数据为 \(XP\) ,降维的数据还原为 \((XP)P^T\)

PCA对线性的数据的降维效果是比较好的。

kPCA (kernel PCA)

A method combining PCA and kernel tricks.

Non-negative Matrix Factorization, NMF

like PCA, except the coeffients in the linear combination must be non-negative.

It will converge to a local minima (therefore starting point in optimizing matters).

Random Projection

a \(d\) -dimensional original data is projected to a \(k\) -dimensional ( \(k\ll d\) ) subspace.

\(X\in\R^{d\times N}\) : original matrix for \(N\) samples with \(d\) features
\(R\in\R^{k\times N}\) : random transormation matrix
the resulting matrix in lower-dimensional space is \(RX\in\R^{k\times N}\) .

generating the random matrix: For a Gaussian random projection, random N vectors with unit length and orthogonal each other from a Gaussian distribution.

因子分析 Factor Analysis

假定有一个不可观测的隐因子(latent fator) \(z_j=(j=1,..,k)\) 的集合。

多维定标 Multi-Dimensional Scaling, MDS

应用于场景:已知N个点每对之间的距离 \(d_{ij}\) ,但不知道每个点的坐标、维度、以及距离计算方法。MDS是在这种情况下将这些点映射到低维空间的方法。

线性判别式分析 Linear Distriminant Analysis, LDA

以类别数量K=2(二分类)为例。类别标签 \(C_1, C_2\) . 已知式 \(z=\bm w^T \bm x\) 表示 \(\bm x\)\(\bm w\) 上的投影。记 \(\bm m_1\)\(m_1\) 是类 \(C_1\) 样本在投影前和投影后的均值,故 \(\bm m_1\in \mathbb R^d, m_1\in\mathbb R\) (d为特征维度,亦有 \(\bm w\in\mathbb R^d\) ),记样本集 \(\mathcal X=\{\bm x^{(t)}, y^{(t)}\}\) ,其中类别信息 \(y^{(t)}=1\) 时表示样本 \(\bm x^{(t)}\) 类别为 \(C_1\)\(y^{(t)}=0\) 时表示样本类别为 \(C_2\) 。则

\[m_1 = \frac{\sum_t \bm w^T \bm x^{(t)} y^{(t)}}{\sum_t y^{(t)}}=\bm w^T\bm m_1 \\ m_2 = \frac{\sum_t \bm w^T \bm x^{(t)} (1-y^{(t)})}{\sum_t (1-y^{(t)})}=\bm w^T \bm m_2 \]

来自 \(C_1, C_2\) 的样本投影后的散布(scatter)为

\[s_1 = \sum_t(\bm w^T \bm x - m_1)^2 y^{(t)} \\ s_2 = \sum_t(\bm w^T \bm x - m_2)^2 (1-y^{(t)}) \]

投影后为使类别间能被很好地分开,我们希望均值间尽可能远离并且类实例散布在尽可能小的区域中。因此希望 \(|m_1-m_2|\) 大,而 \(s_1+s_2\) 小。Fisher's linear discriminant方法主要为求得使下式最小的 \(\bm w\)

\[J(\bm w)=\frac{(m_1 - m_2)^2}{s_1^2+s_2^2} \]

等距特征映射 Isometric Feature Mapping (Isomap)

沿向量空间上的流形的距离计算(如人脸转动时的连拍图片形成的流形)。

局部线性嵌入 Locally Linear Embedding, LLE

Sammon Mapping

(non-linear) (by Lee and Verleysen, 2007)

……

Curvilinear Components Analysis, CCA

(non-linear) (by Demartines and Herault, 1997)

Stochastic Neighbor Embedding, SNE

(non-linear) (by ?)

t-distributed Stochastic Neighbor Embedding, t-SNE

tSNE

(by L Maaten & G Hinton, 2008)

(该论文中有很多解决数据实际问题的思想值得琢磨)

It's a non-linear dimensionality reduction algorithm.

\(N\) high-dimensional data points \(\bm x_1,...,\bm x_N\) , low-dimensional couterparts \(\bm Y=[\bm y_1 ... \bm y_N]^T\) .

conditional probability for \(j\ne i\) :

\[p_{j|i} := \frac{\exp(-\|\bm x_i-\bm x_j\|^2/\sigma_i^2)}{\sum_{k \ne i}\exp(-\|\bm x_i-\bm x_k\|^2/\sigma_i^2)} \\ q_{j|i} := \frac{\exp (-\| \bm y_i -\bm y_j\|^2)}{\sum_{k\ne i} \exp (-\| \bm y_i - \bm y_k \|^2)} \]

\[p_{i|i} := 0 \\ q_{i|i} := 0 \]

Note that \(\sum_j p_{j|i}=1, \sum_j q_{j|i}=1, \forall i\) .

Loss function, sum of Kullback-Leibler divergence on condational probabilities over all data points:

\[J(\bm Y)=\sum_i \mathrm{KL}(P_i \| Q_i)=\sum_i \sum_j p_{j|i} \log \frac{p_{j|i}}{ q_{j|i}} \]

An alternative loss function of a single KL divergence on joint probabilities:

\[J=\mathrm{KL}(P \| Q) = \min_{\bm Y} \sum_{i\ne j} p_{ij} \log \frac{p_{ij}}{q_{ij}} \]

SNE:

joint probabilities:

\[p_{ij} := \frac{\exp(-\|\bm x_i-\bm x_j\|^2/\sigma^2)}{\sum_{k \ne l}\exp(-\|\bm x_k-\bm x_l\|^2/\sigma^2)} \\ q_{ij} := \frac{\exp (-\| \bm y_i -\bm y_j\|^2)}{\sum_{k\ne l} \exp (-\| \bm y_k - \bm y_l \|^2)} \]

but the above \(p_{ij}\) causes problems when a high-dimensional datapoint xi is an outlier (i.e., all pairwise distances $|x_i − x_j|^2 $ are large for xi). For such an outlier, the values of pi j are extremely small for all j, so the location of its low-dimensional map point yi has very little effect on the cost function. As a result, the position of the map point is not well determined by the positions of the other map points.

then

\[p_{ij} := \frac{p_{i|j}+p_{j|i}}{2N} \]

which ensures that \(\sum_j p_{ij}> 1/2N\) for all datapoints xi, as a result of which each datapoint xi makes a significant contribution to the cost function.

t-SNE:

\[q_{ij} := \frac{(1+\| \bm y_i - \bm y_j \|^2)^{-1}}{\sum_k\sum_{l\ne k}(1+ \| \bm y_k - \bm y_l \|^2)^{-1}} \]

\[q_{ii} := 0 \]

python tool
# sklearn用CPU计算,数据量大时比较耗时
from sklearn.manifold import TSNE
TSNE(n_components=2).fit_transform(X)       # to 2-dimensional

# `cuML`库用GPU计算,耗时大大降低,https://docs.rapids.ai/api/cuml/stable/api.html#tsne

Independent Component Analysi, ICA

Linear independent component analysis, linear ICA

general definition of linear independent component analysis:

linear noiseless ICA

linear noisy ICA

posted @ 2022-07-06 23:38  二球悬铃木  阅读(6)  评论(0编辑  收藏  举报