相似性 similarity | Pearson | Spearman | p-value | 相关性 correlation | 距离 distance | distance measure
这几个概念不能混淆,估计大部分人都没有完全搞懂这几个概念。
看下这个,非常有用:Interpret the key results for Correlation
euclidean | maximum | manhattan | canberra | binary | minkowski
初级
先演示一下相关性:
a <- c(1,2,3,4) b <- c(2,4,6,8) c <- data.frame(x=a,y=b) plot(c) cor(t(c))
> cor(t(c)) [,1] [,2] [,3] [,4] [1,] 1 1 1 1 [2,] 1 1 1 1 [3,] 1 1 1 1 [4,] 1 1 1 1
初步结论:
1. 相关性是用来度量两个变量之间的线性关系的;
2. 如果在不同的sample中,x随着y的增大而增大,那么x和y就是正相关,反之则是负相关;
接下来求距离:
> dist(c, method = "euclidean") 1 2 3 2 2.236068 3 4.472136 2.236068 4 6.708204 4.472136 2.236068
> sqrt(2^2+1^2) [1] 2.236068
初步结论:
1. 距离是在特定的坐标体系中,两点之间的距离求解;
2. 距离可以用来表征相似度,比如1和2就比1和4更相似;
3. 欧氏距离就是我们最常见的几何距离,比较直观;
那么什么时候求相关性,什么时候求相似度呢?
基因表达当然要求相关性了,共表达都是在求相关性,就是基因A和B会在不同样本间同增同减,所以相关性是对变量而言的,暂时还没听说对样品求相关性,没有意义,总不能说在这些基因中,某些样本的表达同增同减吧。
那么样本最常见的应该是求相似度了,我想知道样本A是和样本B更相似,还是和样本C更相似,在共同的基因坐标下求个距离就知道了。
进阶
1. 不同的求相关性的方法有何差异?
2. 不同的距离计算的方法有何差异?
3. 相关性分析有哪些局限性?
简单介绍一下pearson和spearman的区别
x=(1:100) y=exp(x) cor(x,y,method = "pearson") # 0.25 cor(x,y,method = "spearman") # 1 plot(x,y)
结论:
pearson判断的是线性相关性,
而spearman还可以判断非线性的。monotonic relationship,更专业的说是单调性。
参考:Correlation (Pearson, Kendall, Spearman)
outlier对相关性的影响
x = 1:100 y = 1:100 cor(x,y,method = "pearson") # 1 y[100] <- 1000 cor(x,y,method = "pearson") # 0.448793 cor(x,y,method = "spearman") # 1 y[99] <- 0 cor(x,y,method = "spearman") # 0.9417822
结论:
单个的outlier对pearson的影响非常大,但是对spearman的影响则比较小。
皮尔逊相关系数 其实是衡量 两个变量线性相关程度大小的指标,但它的值的大小并不能完全地反映两个变量的真实关系。
只有当两个变量的标准差都不为零,相关系数才有意义。
Pearson, Kendall, Spearman三种相关性的差异
distance measure
euclidean | maximum | manhattan | canberra | binary | minkowski
k-NN 4: which distance function?
distance measure euclidean
euclidean:
Usual distance between the two vectors (2 norm aka L_2), sqrt(sum((x_i - y_i)^2)).
maximum:
Maximum distance between two components of x and y (supremum norm)
manhattan:
Absolute distance between the two vectors (1 norm aka L_1).
canberra:
sum(|x_i - y_i| / (|x_i| + |y_i|)). Terms with zero numerator and denominator are omitted from the sum and treated as if the values were missing.
This is intended for non-negative values (e.g., counts), in which case the denominator can be written in various equivalent ways; Originally, R used x_i + y_i, then from 1998 to 2017, |x_i + y_i|, and then the correct |x_i| + |y_i|.
binary:
(aka asymmetric binary): The vectors are regarded as binary bits, so non-zero elements are ‘on’ and zero elements are ‘off’. The distance is the proportion of bits in which only one is on amongst those in which at least one is on.
minkowski:
The p norm, the pth root of the sum of the pth powers of the differences of the components.
待续~