Chapter 5: Unsupervised Learning

Acknowledgment: Most of the knowledge comes from Yuan Yang's course "Machine Learning".

Principle component analysis(PCA)

The direction keeping more variance is more important.

variance is defined this way:

\[E_{\mathbf{x}_i} \left[ \langle \mathbf{v}, \mathbf{x}_i \rangle^2 \right]= \frac{1}{n} \sum_{i=1}^{n} (\mathbf{v}^{\top} \mathbf{x}_i)^2 = \frac{1}{n} \mathbf{v}^{\top}\mathbf{X}\mathbf{X}^{\top} \mathbf{v} \]

Here $ \mathbf{X} = [\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n] \in \mathbb{R}^{d \times n} $.

我们的目标就是

\[\max_{\mathbf{v}} \mathbf{v}^{\top} \mathbf{XX}^{\top} \mathbf{v} \quad \text{s.t. } \mathbf{v}^{\top} \mathbf{v} = 1 \]

那只需要求出\(\mathbf{XX}^{\top}\)最大的那个特征值对应的特征向量就可以了。

Relationship to SVD

对于任意的 \(X \in \mathbb{R}^{𝑑×𝑛}\),我们有 \(X=U\Sigma V^{\top}\)因此,我们有 \(XX^{\top}=U\Sigma V^{\top}V\Sigma U^{\top}=U \Sigma^2 U^{\top}\)

\(\Sigma^2\)\(XX^{\top}\)的特征根

Power method

\(b_0 = \text{random vector in } \mathbb{R}^d.\) Denote\(b_0 = a_1 v_1 + a_2 v_2 + \dots + a_d v_d, v_1, ..., v_d\) 都是\(𝑋𝑋^{\top}\)的特征向量

\[\begin{align*} b_{t+1} = \frac{(XX^{\top}) b_t}{\|XX^{\top} b_t\|} = \frac{1}{C} (\lambda_1 a_1 v_1 + \dots + \lambda_d a_d v_d) \end{align*} \]

So \(\lambda_1\) grows exponentially faster than \(\lambda_i,i \geq 2\). After big enough \(T\), we can find \(v_1\) 最大的那个特征值对应的特征向量。

Once we know \(v_1\), we can remove \(v_1\) components from \(b_0\) and then find second largest eigenvector…

alternative interpretation

we can Minimizing reconstruction error

\[\min \left[ \frac{1}{n} \sum_{i=1}^{n} \lVert \mathbf{x}_i - (\mathbf{v}^{\top} \mathbf{x}_i) \mathbf{v} \rVert^2 \right] \]

Once we know the decomposition, we can pick \(k<d\) and project the data point onto a subspace expanded by \(v_1,v_2,⋯, v_k\).

Nearest neighbor

KD tree

如果 \(d\) 较小,我们可以使用 k-d 树

\(R^d\) 中有 \(N\gg1\) 个数据点,称之为 \(P\)。给定查询点 \(q\in R^d\),要找到使得 \(\|q−p\|\) (可以是各种norm)最小化的最近点 \(p\in 𝑃\)

但是维数要小这个是一个很苛刻或者说大家很不想遇到的条件

The exact nearest search is hard. Usually, we don’t need an exact answer. We only need an approximate NN search.

Relaxation from exact to approximate: R-near neighbor problem

找到 \(p\in P\) 使得 \(\|p−q\| \leq R\)

Randomized \(\mathcal{c}\)-approximate \(\mathcal{R}\)-near neighbor, or \((\mathcal{c},\mathcal{R})\)-\(\mathcal{NN}\).

Given a set \(P\) of points in a \(d\)-dimensional space \(\mathbb{R}^d\), and parameters \(\mathcal{R}>0\), \(\delta>0\), construct a data structure s.t. given any query point \(q\), if there exists an \(\mathcal{R}\)-near neighbor of \(q\) in \(P\), it reports some \(\mathcal{c}\mathcal{R}\)-near neighbor of \(q\) in \(P\) with probability \(1-\delta\).

Locality-sensitive hashing(LSH)

A family \(\mathcal{H}\) is called \((\mathcal{R}, \mathcal{cR}, \mathcal{P}_1, \mathcal{P}_2)\)-sensitive if for any two points \(p, q \in \mathbb{R}^d\):

\[\begin{align*} \text{If } \|p - q\| &\leq \mathcal{R} \text{, then }\Pr_{h \in \mathcal{H}}[h(q) = h(p)] \geq \mathcal{P}_1\\ \text{If } \|p - q\| &\geq \mathcal{cR} \text{, then }\Pr_{h \in \mathcal{H}}[h(q) = h(p)] \leq \mathcal{P}_2 \end{align*} \]

For a locality-sensitive hash (LSH) family to be useful, it has to satisfy \(\mathcal{P}_1 > \mathcal{P}_2\).

After mapping by \(h\), close points are equal while far points are not. If \(\mathcal{P}_1 = 1, \mathcal{P}_2=0\), we are done.

So we want to increase the gap between \(\mathcal{P}_1\) and \(\mathcal{P}_2\)

Table Construction

An \(L \times k\) matrix, where each element has a function \(h_{i,j}\) randomly chosen from the Locality-Sensitive Hashing (LSH) family.

还有L个哈希表,每个表\(g_i\) is the \(i\)-th row of the matrix, containing \(k\) elements: \(g_i = (h_{i,1}, h_{i,2}, \ldots, h_{i,k})\).

Query Algorithm for Point \(q\)

遍历所有\(L\)个哈希表:

  • 计算 \(g_j(q)\)(一个长度为k的向量),这涉及将第\(j\)个哈希表中的每个函数应用于查询点 \(q\)
  • 结果 \(g_j(q)\)确定了 \(q\) 将被放置在第 \(j\) 哈希表中的哪个“桶”中。桶是一种数据结构,存储着对应同一种hash结果的点
  • 检索存储在该桶中的所有点。对于每个检索到的点:
    • 计算它与查询点\(q\)之间的距离,如果距离小于或等于\(cR\),那么将其视为候选的\(cR-NN\)
    • 看过超过\(L' = 2L+1\)个点之后停止

Thm

If there exists \(p^* \in \mathcal{B}(q, R)\) (表示以点 \(q\) 为中心、半径为 \(R\) 的球形空间), we will find a point that is \(cR\)-near to \(q\) with probability at least \(1/2 - 1/e\).

定义前半句话为事件A,定义后半句话为事件B,只要我们bound住了\(P(\bar{A})\)\(P(\bar{B})\), 然后union bound再取反,就可以了

Proof:

step1: 先bound \(P(\bar{B})\)

Set \(\rho = \frac{\log\left(\frac{1}{P_1}\right)}{\log\left(\frac{1}{P_2}\right)} = \frac{\log{P_1}}{\log{P_2}}, k = \log_{1/P_2}(n) = \log_{P_2}(\frac{1}{n}), L = n^{\rho}\) (n是总的数据量)

对于一个\(g_i\), 如果存在一个点\(p^*\)\(R\)-near的,那么存在一个不在B中的点$p' \(满足\)g_i(p') = g_i(q)$ 的概率\(\le P_2^k = \frac 1 n\)

因为你每个hash函数都有最大\(P_2\)的概率能对,那么k个函数都对的概率就是\(P_2^k\)

那么\(g_i\)的满足上述条件的点的个数期望是\(\le 1\)(因为总共有n个点,每个at most 1/n). 一个有L个\(g_i\), 那总期望就是\(\le L\)

那么用markov 不等式,得到总出错个数\(\ge 2L\)的概率是\(\le \frac 1 2\). 也就是在这个桶里面找了2L个点还没找到\(cR\)-near的点的概率。也就是\(P(\bar{B}) \le \frac 1 2\)

step 2: 再bound \(P(\bar{A})\)

对于前面我们提到的那个\(R\)-near的\(p^*\)

\(\Pr[g_i(p^*) = g_i(q)] \geq P_1^k = P_1^{\log_{P_2}(\frac{1}{n})} = (\frac{1}{n})^{\frac{\log{P_1}}{\log{P_2}}} = n^{-\rho} = \frac 1 L\) (LSH的条件,中间有一步换底公式)

\(\Pr\left[\left(g_i(p^*) \neq g_i(q)\right), \forall i\right] \leq \Big(1 - \Pr[g_i(p^*) = g_i(q)]\Big)^L \le \left(1 - \frac 1 L\right)^{L} \leq \frac{1}{e}\) 也就是不存在这么一个点的概率。也就是\(P(\bar{A}) \le \frac 1 e\)

那么两个事件一起发生的概率就是\(\ge \frac 1 2 - \frac 1 e\)

这个定理出来之后,我们也就能理解为什么前面定义LSH的时候,只需要定义存在R near的时候两点映射到一起的概率,因为只要找到一个R near的点,就进而能为找到一个cR near的点算出一个固定的概率下界。

LSH example

\(h_{(r,b)}(x) = \left\lfloor \frac{\langle r, x \rangle + b}{w} \right\rfloor\)

Explanation:

  • \(w\) is the unit length, a hyperparameter
  • \(r\in \mathbb{R}^d \sim \text{Gaussian}\)
  • \(b \sim \text{uniformly}[0, w)\)
  • This formula Projecting all the vectors into a line and splits the line into different buckets.
    If \(\langle r, p \rangle, \langle r, q \rangle\) are in the same bucket, we get an equal sign in this function. That is their difference before \(\lfloor \cdot \rfloor\) operation is within \([-1, 1]\).

Given two points p, q

  • \(\langle r, p \rangle, \langle r, q \rangle, \langle r, p - q \rangle\) all follow Gaussian distribution.

Simple case: \(w = 1\)

For example, let \(w = 1\), \(\|p - q\|_2 = 1\), the probability in the same bucket is given by

\[P(c) = \Pr\left[h_{(r, b)}(p) = h_{(r, b)}(q)\right] =2 \int_0^1 f_p(t)(1 - t) \, dt \]

Explanation:

  • (当\(\langle r, p - q \rangle=t\)的概率) * (\(\langle r, p - q \rangle=t\)\(p, q\)在一个桶里的概率)对\(t\)积分,就是\(p, q\)出现在一个桶里的概率。 \(t\) 的取值从0 到1,\([-1, 0]\) and \([0, 1]\) is symmetric. 所以最后有个乘2
  • 前一项,Because \(\langle r, p - q \rangle\) follows a Gaussian distribution, so \(\Pr(\langle r, p - q \rangle = t) =f_p(t)\)
  • 后一项,\(\langle r, p - q \rangle= t\) 时 p, q 在一个桶里的概率 是什么呢?也就是\(\langle r, p\rangle + b \in [0, 1-t]\). 当然这里可能有点细节没注意,但是区间长度肯定是\(1-t\) . 那么概率就是 \(\Pr\big[\langle r, p\rangle + b \in [0, 1-t]\big] = 1-t\).

Scaled case for \(w\) and \(\|p-q\|_2\)

Let \(c = \|p - q\|_2\)

The probability \(P(c)\), which is the probability that \(h_{(r, b)}(p) = h_{(r, b)}(q)\), is given by

\[P(c) = \Pr\left[h_{(r, b)}(p) = h_{(r, b)}(q)\right] = 2\int_0^w \frac{1}{c} f_p\left(\frac{t}{c}\right)\left(1 - \frac{t}{w}\right) \, dt \]

Explanation:

  • Because $ \frac{\langle r, p - q \rangle}{|p-q|_2} =\frac{\langle r, p - q \rangle}{c} $ follows a Gaussian distribution, so \(\Pr\big[\langle r, p - q \rangle = [0, t]\big] = \Pr\big[\frac{\langle r, p - q \rangle}{c} = [0, \frac t c]\big] = \frac 1 c f_p(\frac t c)\)

    • 但是我们注意到这里最后一个等号并不是等于\(f_p(\frac t c)\),那是因为我们可以证明$ f_p\left(\frac{t}{c}\right)$ 在整个空间上的积分不是1,而是c。所以我们要归一化一下, \(\Pr = \frac 1 c f_p(\frac t c)\)。这个证明的积分如下:

    • \[\int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{\left(\frac{x}{c}\right)^2}{2}\right) \, dx = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{u^2}{2}\right) c \, du = c \]

  • \(t\) 的取值从0 到 \(w\)

nearest neighbor classification

k nearest neighbor

This is a non-parametric model. All its results depend on data.

Metric Learning

We want to learn a projection function \(f(x)\) s.t. 在这个投影空间/feature空间里面,每个点和自己的最近邻有一样的label (就是把原始数据投影到一个更好分类的空间)

But we do not have the gradient information! What can we do? Make hard to soft

NCA algorithm (Neighborhood Component Analysis)

\[\begin{align*} p_{i,j} = \frac{e^{-\|f(x_i) - f(x_j)\|^2}}{\sum_{k \neq i} e^{-\|f(x_i) - f(x_k)\|^2}}\\ C_i = \{j \mid c_i = c_j\}\\ P_i = \sum_{j \in C_i} p_{ij}\text{ (每个点都要和自己这个类的所有点算)}\\ L(A) = \sum_i \left(\sum_{j \in C_i} p_{ij}\right) = \sum_i P_i\text{ (所有点的loss)} \end{align*} \]

LMNN algorithm

\(L_{\text{rank}} = \max(0, \|f(x) - f(x^+)\|_2 - \|f(x) - f(x^-)\|_2 + r)\)

我们希望和负样本点的距离越大越好,和正样本点的距离越小越好。

在这个背景下,选择正样本\(x^+\)和负样本\(x^-\)是为了创建一个最坏情况的对比,以便鼓励模型更好地分类样本。这有助于防止损失很容易降为零,确保模型在训练时能够更稳健地学习。

感觉不如cross entropy

Clustering

A very important concept in unsupervised learning.

It aims to Grouping objects. Objects in the same group are similar to each other. But no unique solution! Depends on loss function.

K-means algorithm

Given \((x_1, x_2, \ldots, x_n)\), \(x_i \in \mathbb{R}^d\), we want to partition them into \(k\) sets \(\{S_1, S_2, \ldots, S_k\}\) to minimize the within-cluster sum of squares.

\(\arg\min_{S} \sum_{i=1}^k \sum_{x \in S_i} \|x - \mu_i\|^2\) where \(\mu_i\) is the mean of points in \(S_i\).

In general, NP-hard to find the optimal solution. One heuristic algorithm is Lloyd’s algorithm.

算法的基本步骤如下:

  1. 初始化: 随机选择k个数据点作为初始的簇中心。
  2. 分配阶段: 对于每个数据点,计算其与各个簇中心的距离,并将其分配到距离最近的簇。
  3. 更新簇中心: 对每个簇,计算其所有成员的平均值,并将该平均值作为新的簇中心。
  4. 迭代: 重复步骤2和步骤3,直到簇中心不再发生显著变化,或者达到预定的迭代次数。

Lloyd's算法通过不断迭代分配和更新来优化簇中心,最终实现数据点被划分到相对紧凑且距离较近的簇中。这是一种常用的聚类算法,广泛应用于数据挖掘和模式识别领域。

Why does it terminate? The average distance to members in each cluster is always decreasing

Theoretically, it may take \(2^{\Omega(\sqrt{n})}\) iterations, but empirically very fast.

limitation:

  • 有可能找到local minimum. This bad performance can happen even with well-separated Gaussian clusters.
  • Sometimes, \(ℓ_p\) distance is not necessarily the best metric. 比如同心圆的这种分布,就没有办法用这种方法了。

前一种的发生概率不大,后一种要用Spectral graph clustering来解决

Spectral graph clustering

In general, we hope to define “similarity” between points.

There are many ways to define edges in the graph. Then we do clustering for graph Laplacian.

To find \(k\) clusters using Laplacian, follow these steps:

  • Compute the first eigenvectors \(\mu_1, \ldots, \mu_k\) of \(L\),
    • corresponding to the top \(k\) eigenvalues (close to 0, not necessarily \(= 0\)).
  • Let \(U \in \mathbb{R}^{n \times k}\) be the matrix containing vectors \(\mu_1, \ldots, \mu_k\) as columns.
  • For \(i = 1, \ldots, n\), \(y_i \in \mathbb{R}^k\) be the vector as the \(i\)-th row of \(U\) representing cluster \(\{y_i\}_{i=1}^{n}\) in \(\mathbb{R}^k\).
  • Run \(k\)-means for \(\{y_i\}\), get \(C_1, \ldots, C_k\).
  • Output \(A_1, \ldots, A_k\) with \(A_i = \{j \mid y_j \in C_i\}\).

why does this make sense?

Example 1: idea case

假设我们现在有一张图,分成了\(k\)个connected component, 不同的块之间没边,同一个块之间有无权边。

那么如果我们去找\(k\)个最小的eigenvalue,那么就一定是0,然后对应的是各个分量都相等的那个向量。

也就是:

\[L = \begin{bmatrix} L_1 & 0 & 0 \\ 0 & L_2 & 0 \\ 0 & 0 & \ddots \\ \end{bmatrix} \]

对应的k个向量就分别是,对应矩阵的位置上,全取1,剩下的位置全取0,当然要归一化。

以这些向量为column拼起来,就长这样:

\[U = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 1 \\ \end{bmatrix} \]

对row做k-means自然就可以一个一个块的分好了。前四行的1都来自L1的eigenvector那就说明L1有四个点。

However, in general, G is connected, and we can not just find these isolated nicely looking components!!!

Example2: ratio cut

Graph cut problem:
\(\text{mincut}(A_1, \ldots, A_k) = \frac{1}{2}\sum_{i=1}^{k} W(A_i, \bar{A_i})\)

\(W(A_i, \bar{A_i})\) means the total weights between \(A_i\) and the rest.

However, this definition may have trivial solutions. For example, \(A_2, A_3, \ldots, A_k\) has only 1 node inside! So instead, look at the ratio cut:

\(\text{ratiocut}(A_1, \ldots, A_k) = \frac{1}{2}\sum_{i=1}^{k} \frac{W(A_i, \bar{A_i})}{|A_i|}\)

So if \(|A_i|\) is small, this is large. Ratiocut is NP-hard.

Consider \(\min_A \text{ratiocut}(A, \bar{A})\)

Assume \(G\) is connected, all-one vector is the eigenvector corresponding to the smallest eigenvalue.

Define \(v^A = (v_1, v_2, \ldots, v_n) \in \mathbb{R}^n\) such that:

\[v_i^A = \begin{cases} \sqrt{\frac{|\bar{A}|}{|A|}}, & \text{if } i \in A \\ -\sqrt{\frac{|A|}{|\bar{A}|}}, & \text{otherwise} \end{cases} \]

\[\begin{align*} (v^A)^{\top}Lv^A &= \frac{1}{2}\sum_{ij} w_{ij}(v_i^A - v_j^A)^2 \\ &= \frac{1}{2}\sum_{i \in A, j \in \bar{A}} w_{ij} \left(\sqrt{\frac{|\bar{A}|}{|A|}} + \sqrt{\frac{|A|}{|\bar{A}|}}\right)^2 + \frac{1}{2}\sum_{j \in A, i \in \bar{A}} w_{ij} \left(\sqrt{\frac{|\bar{A}|}{|A|}} + \sqrt{\frac{|A|}{|\bar{A}|}}\right)^2 \\ &= \text{cut}(A, \bar{A})\left(\frac{|\bar{A}|}{|A|} + \frac{|A|}{|\bar{A}|} + 2\right) \\ &= \text{cut}(A, \bar{A})\left(\frac{|\bar{A}| + |A|}{|A|} + \frac{|A| + |\bar{A}|}{|\bar{A}|}\right) \\ &= |V| \cdot \text{ratiocut}(A, \bar{A}) \end{align*} \]

In other words, to minimize ratiocut, it suffices to find a vector \(v^A\) to minimize \((v^A)^{\top}Lv^A\)! Therefore:

\[\min_A \text{ratiocut}(A,\bar{A}) = \min_{A \subseteq V} (v^A)^{\top}L\, v^A \]

Here we require \(v^A\) defined using \(A\). 也就是v的形式是限定死的,必须是前面给出的\(\sqrt{\frac{|\bar{A}|}{|A|}}\)那样子的形式。

This is, of course, NP-hard!!

我们现在考虑松弛这个问题。首先\(v^A\)要满足哪些条件:

  • \(\sum_i v^A_i = 0\), namely \(\langle v^A_i, I_V \rangle = 0\)(\(I_V\)) 是一个全为1的向量

  • \(\|v^A\|_2^2 = |V|\)

(当然为什么选这两个性质来relax,这确实不知道)

We may relax the problem:

\[\min_{v} v^{\top}L\, v \quad \text{s.t.} \quad v \perp I_V, \, \lVert v \rVert = \sqrt{n} \]

\(v\) can be any real vector!

Notice that \(v \perp I_V\), so the solution is the second smallest eigenvalue. Eigenspace orthogonal to \(I_V\).

We may encounter a problem!!! We changed the form of our problem how can I change the solution of the relaxed version to the original version.

Once we solve \(v\), we may want to “recover” \(A\), but \(v\) is no longer defined in that way. (v不再是前面给出的\(\sqrt{\frac{|\bar{A}|}{|A|}}\)那样的形式)

We can use k-means! Run 2-means on \(R\) for the points \(v_1,⋯, v_n\).

k>2 is almost the same. More component.

The relaxed problem is:

\[\min_{A_1, ..., A_k} tr(H^{\top}L H), s.t. H^{\top}H = I \]

SimCLR

Contrasive loss and infoNCE loss

Given img \(q_i\) and augmented img \(p_i\), we want to learn a classification model. It can distinguish whether \(q_i\) and \(p_i\) are pairs or indeed neighbors in the semantic space otherwise in their original space pixel space they are very very different.

\[\text{infoNCE loss} = \sum_{i}L(q_i,\{p_j\}_{j=1}^{N}) = \sum_{i} - \log \frac{\exp(-\|q_i - p_i\|^2/2\tau)}{\sum_{j=1}^{N}\exp(-\|q_i - p_j\|^2/2\tau)} \]

Markov random fields

Given a graph \(\pi\), we may sample unweighted subgraphs from \(\pi\). 边权\(W_{i,j} \in \{0,1\}\). 无权图嘛

The score of \(W\): \(s(W, \pi) = \prod_{(i,j) \in [n]^2} \pi_{i,j}^{W_{i,j}}\) 所有边权的累乘

Add a restriction \(\Omega(W)\) to the score: \(s(W, \pi) = \Omega(W) \prod_{(i,j) \in [n]^2} \pi_{i,j}^{W_{i,j}}\)

For example, \(\Omega(W) = 1\) if and only if each node in \(W\) has out-degree \(=1\).

Each \(W\) is sampled with probability proportional to its score. \(P(W;\pi) \propto \Omega(W) \prod_{(i,j) \in [n]^2} \pi_{i,j}^{W_{i,j}}\).

Because it is a probability, it certainly needs to be normalized.

\[\begin{align*} P(W;\pi) &= \frac{\Omega(W) \prod_{(i,j) \in [n]^2} \pi_{i,j}^{W_{i,j}}}{R(z)}\\ R(Z) &= \sum_{W} \Omega(W) \prod_{(i,j) \in [n]^2} \pi_{i,j}^{W_{i,j}} \end{align*} \]

Define cross entropy loss \(H_\pi^k(Z)\)

\(W_X\) and \(W_Z\) are random variables(subgraph) sampled from \(\pi\)(origin graph) and \(K_Z\)(semantic graph or mapped graph).

Cross entropy loss:

\[\begin{align*} H_\pi^k(Z) &= -\sum_{i} \Pr_{W_x \sim \pi}[W_x = i]\log \Pr_{W_z \sim K_Z}[W_z = i]\\ &= -\mathbb{E}_{W_X \sim P(\cdot; \pi)}[\log P(W_Z = W_X; K_Z)] \end{align*} \]

prove \(H_\pi^k(Z)\) is equivalent to InfoNCE loss

\(H_\pi^k(Z) = -\mathbb{E}_{W_X \sim P(\cdot;\pi)}[\log P(W_Z=W_X;K_Z)]\)

Because each node only has only one neighbor, each node is independent.

That is in the adjacent matrix, every row \(i\) is independent. We can rewrite the loss as:

\[H_\pi^k(Z) = -\sum_{i} \mathbb{E}_{W_X,i}[\log P(W_{Z,i}=W_{X,i};K_Z)] \]

Consider each node independently, which means we sample each node \(i\) 's neighbor \(j\) with \((i, j)\) 's similarity in \(\pi\).

For unitary out-deg \(W\), \(W_i \sim \text{M}(1, \pi_{i}/\sum_{j}\pi_{i,j})\). 也就是归一化的相似度/边权作为这条边被采样到的概率。

\(W_{X,i}\) is the \(i\)-th row of \(W_X\) with a single 1 (to \(j\)), other entries are 0.(因为每个点只有一条出边,所以就是one-hot binary vector) Same for \(W_{Z,i}\).

\[\begin{align*} \text{InfoNCE} &= -\sum_{i=1}^N \log\frac{\exp\left(-\frac{\|f(X_i) - f(X_{i'})\|^2}{2\tau}\right)}{\sum_{j=1}^N \exp\left(-\frac{\|f(X_i) - f(X_j)\|^2}{2\tau}\right)} \\ &\text{$X_i$ 是$W_X$这个邻接矩阵中的一个行向量,代表的是一个点的连边情况, $f(X_i)=Z_i$}\\ &= -\sum_i \log\frac{ k(Z_i, Z_{i'})}{\sum_{j=1}^Nk(Z_i, Z_j)}\\ &(\text{use gaussian kernel as distance/similarity function, denote is as } k(\vec{u}, \vec{v}))\\ &= -\sum_i \log\frac{ k(Z_i, Z_{i'})}{\|K_{Z,i}\|_1} \end{align*} \]

Let \(Q_i = \frac{K_{Z,i}}{\|K_{Z,i}\|_1}\)\(Q_i\) 可以看成是一个行向量,其中的第k个分量表示的是,ik两个点的相似度。

当然有一个归一化的factor,这样就可以作为概率分布。

\[\begin{align*} \text{InfoNCE} &= -\sum_{i=1}^N \log Q_{i,i'} \\ &= -\sum_{i=1}^N \Big(\sum_{i'=1}^{N}\Pr_{W_X \sim P(\cdot; \pi)}[(W_{X})_{(i, i')} = 1]\Big)\log Q_{i,i'} \\ & \text{增加的这一项是1,所以没影响。 $(W_{X})_{(i, i')}=1$指的是,$W_X$中$(i, i')$有连边}\\ &= -\sum_{i=1}^N \sum_{i'=1}^{N}\Big(\Pr_{W_X \sim P(\cdot; \pi)}[(W_{X})_{(i, i')} = 1]~\log Q_{i,i'}\Big) \\ &\text{从第一行到这一行,其实有点奇怪:第一行的$i'$是确定好的?这一行的$i'$是服从一个概率分布?}\\ &= -\sum_{i=1}^N \sum_{i'=1}^{N}\Big(\Pr[W_{X, i}=i']~\log \Pr[W_{Z, i} = i']\Big) ~~\text{这一行只是改一下notation} \\ &= -\sum_i \mathbb{E}_{W_X,i}[\log P(W_{Z,i}=W_{X,i};K_Z)] \\ &= H_\pi^k(Z) \end{align*} \]

Put the probability inside.

\[\begin{align*} \log P(W_{Z,i}=W_{X,i};K_Z) &= \log \frac{\Omega(W_Z) \prod_{(i,j) \in [n]^2} (K_{Z})_{i,j}^{(W_{Z})_{i,j}}}{R(Z)}\\ &= \log \frac{\Omega(W_X) \prod_{(i,j) \in [n]^2} (K_{Z}) _{i,j}^{(W_{X})_{i,j}}}{R(Z)}~~\text{将一些东西换成X的下标}\\ &= \sum_{i,j}(W_{X})_{i,j}\log (K_{Z}) _{i,j} +\log \Omega(W_X) -\log R(Z)\\ &= \sum_{i,j}(W_{X})_{i,j}\log k(Z_i - Z_j) +\log \Omega(W_X) -\log R(Z) \end{align*} \]

Keep term on \(Z\), because we need to learn a \(Z\) (or learn a \(f\))

\[\begin{align*} \arg \mathop{\min}_{Z}H_\pi^k(Z) &= \arg \mathop{\min}_Z -\mathbb{E}_{W_X\sim R(\cdot; \pi)}[\sum_{i,j}W_{X i,j}\log k(Z_i - Z_j)-\log R(Z)]\\ &= \arg \mathop{\min}_Z \Big(\mathbb{E}_{W_X\sim R(\cdot; \pi)} [\text{tr}(Z^{\top}L(W_X)Z)]+\log R(Z)\Big) \\ &~\text{(denote the lapacian of $W_X$ as $L(W_X)$}\\ \end{align*} \]

This formula is like the spectral clustering. (See homework for more detail about the last equal sign)

SNE and t-SNE

A data visualization tool, mapping things into low-dimensional space. (usually 2D)

The "distance" in the original space of the data is well defined:

\[p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2_2/2\sigma_i)}{\sum_{k\ne i}\exp(-\|x_i - x_k\|^2_2/2\sigma_i)} \]

We want to map these data points into a lower dimensional space and keep their "distance"

\[\begin{align*} q_{j|i} &= \frac{\exp(-\|y_i - y_j\|^2_2)}{\sum_{k\ne i}\exp(-\|y_i - y_k\|^2_2)}\\ L &= \sum_{i}\sum_{j}p_{j|i} \log \frac{p_{j|i}}{q_{j|i}} \end{align*} \]

We want to optimize this loss.

这是SNE的经典做法,但是还是有缺点的

  • 要优化很多个loss,你看loss其实是对i和j求了两遍和,相当于每个pair贡献了一个loss

  • 这还是不能解决拥挤问题(crowding problem)

    • 先解释一下拥挤问题是什么:以ppt的那个例子为例,10维空间中,r=2的小球,只占据\(\frac{1}{1024}\) 的体积,但是映射到2维空间中的时候,就占据了\(\frac{1}{4}\) 的体积了。这样远距离的点就全部都堆在一起,分不开了。

为了解决第一个问题,t-SNE的做法是,把所有的距离放一起做运算,捏成一个概率分布,优化一个single概率分布的loss,我们就很会设置了

\[p_{ij} = \frac{p_{i|j}+ p_{j|i}}{2n}\\q_{ij} = \frac{d(i, j)}{\sum_{k\ne l} d(k,l)} \]

这样,\(\sum_{ij} p_{ij} = 1, \sum_{ij} q_{ij} = 1\).

为了解决第二个问题,直观的想法是:

  • 高维中近距离的点,在低维中距离要变得更小
  • 高维中远距离的点,在低维中距离要变得更大

核函数要比高斯更加have tail,于是用了\(\frac{1}{1+x^2}\)

\[q_{ij} = \frac{(1+\|y_i-y_j\|^2_2)^{-1}}{\sum_{k\ne l}(1+\|y_k-y_l\|^2_2)^{-1}} \]

还有一个问题,\(\sigma_i\)该如何确定?

\(\sigma_i\) 是非常重要的,在数据密集的地方,distribution要sharp,方差要小。数据稀疏的地方,distribution要smooth,方差要大。

需要一个预设的超参数perplexity,然后算法找到合适的\(σ_i\)值让条件分布\(P_i\)的困惑度等于用户预定义的困惑度

\[\begin{align*} perp(P_i) &= 2^{H(P_i)}\\ H(P_i) &= -\sum_{j}p_{j|i}\log_{j|i} \end{align*} \]

显然\(σ_i\)和困惑度有着相关关系,直观上就是,困惑度越大,H也就越大,说明distribution就越smooth,说明方差也越大。

因为\(\sigma_i\)和perplexity的关系是单调的,那么我们就可以binary search来确定了。