pandas、spark计算相关性系数速度对比
pandas、spark计算相关性系数速度对比
相关性计算有三种算法:pearson、spearman,kenall。
在pandas库中,对一个Dataframe,可以直接计算这三个算法的相关系数correlation,方法为:data.corr()
底层是依赖scipy库的算法。
为了提升计算速度,使用spark平台来加速执行。
比较了pandas,spark并发scipy算法,spark mllib库的计算速度。
总体来说,spark mllib速度最快,其次是spark并发,pandas速度最慢。
corr执行速度测试结果
时间单位:秒
数据大小 | corr算法 | pandas | spark + scipy | spark mllib | 备注 |
---|---|---|---|---|---|
1000*3600 | pearsonr | 203 | 170 | 37 | pyspark |
1000*3600 | pearsonr | 203 | 50 | 没有计算 | spark scipy计算一半 |
1000*3600 | pearsonr | 203 | 125 | 37 | client模式 |
1000*3600 | pearsonr | 202 | 157 | 38 | client模式 |
1000*3600 | spearmanr | 1386 | 6418 | 37 | client模式 |
1000*3600 | spearmanr | 1327 | 6392 | 38 | client模式 |
1000*3600 | kendall | 4326 | 398 | 无此算法 | client模式 |
1000*3600 | kendall | 4239 | 346 | 无此算法 | client模式 |
1000*1000 | spearmanr | 127 | 294 | 12 | client 模式 |
1000*1000 | spearmanr | 98 | 513 | 5.55 | client 模式 |
1000*360 | spearmanr | 13 | 150 | 没有计算 | 160秒,列表推导式 res = [st.spearmanr(data.iloc[:, i], data.iloc[:, j])[0] for i in range(N) for j in range(N)] |
1000*360 | kendall | 40 | 45 | 无此算法 | 116秒,列表推导式 res = [st.kendall(data.iloc[:, i], data.iloc[:, j])[0] for i in range(N) for j in range(N)] |
说明:spearmanr 算法在spark scipy组合下执行速度较慢,需要再对比分析,感觉存在问题的。
三种算法脚本如下:
pandas 脚本
import numpy as np
import pandas as pd
import time
C = 1000
N = 3600
data = pd.DataFrame(np.random.randn(C * N).reshape(C, -1))
print("============================ {}".format(data.shape))
print("start pandas corr ---{} ".format(time.time()))
start = time.time()
# {'pearson', 'kendall', 'spearman'}
res = data.corr(method='pearson')
end_1 = time.time()
res = data.corr(method='spearman')
end_2 = time.time()
res = data.corr(method='kendall')
end_3 = time.time()
print("pandas pearson count {} total cost : {}".format(len(res), end_1 - start))
print("pandas spearman count {} total cost : {}".format(len(res), end_2 - end_1))
print("pandas kendall count {} total cost : {}".format(len(res), end_3 - end_2))
spark scipy脚本
from pyspark import SparkContext
sc = SparkContext()
import numpy as np
import pandas as pd
from scipy import stats as st
import time
# t1 = st.kendalltau(x, y)
# t2 = st.spearmanr(x, y)
# t3 = st.pearsonr(x, y)
C = 1000
N = 3600
data = pd.DataFrame(np.random.randn(C * N).reshape(C, -1))
def pearsonr(n):
x = data.iloc[:, n]
res = [st.pearsonr(x, data.iloc[:, i])[0] for i in range(data.shape[1])]
return res
def spearmanr(n):
x = data.iloc[:, n]
res = [st.spearmanr(x, data.iloc[:, i])[0] for i in range(data.shape[1])]
return res
def kendalltau(n):
x = data.iloc[:, n]
res = [st.kendalltau(x, data.iloc[:, i])[0] for i in range(data.shape[1])]
return res
start = time.time()
res = sc.parallelize(np.arange(N)).map(lambda x: pearsonr(x)).collect()
# res = sc.parallelize(np.arange(N)).map(lambda x: spearmanr(x)).collect()
# res = sc.parallelize(np.arange(N)).map(lambda x: kendalltau(x)).collect()
end = time.time()
print("pearsonr count {} total cost : {}".format(len(res), end - start))
print("spearmanr count {} total cost : {}".format(len(res), end - start))
print("kendalltau count {} total cost : {}".format(len(res), end - start))
# 纯python算法
s = time.time()
res = [st.spearmanr(data.iloc[:, i], data.iloc[:, j])[0] for i in range(N) for j in range(N)]
end = time.time()
print(end-s)
start = time.time()
dd = sc.parallelize(res).map(lambda x: st.spearmanr(data.iloc[:, x[0]], data.iloc[:, x[1]])).collect()
end = time.time()
print(end-start)
start = time.time()
dd = sc.parallelize(res).map(lambda x: st.kendalltau(data.iloc[:, x[0]], data.iloc[:, x[1]])).collect()
end = time.time()
print(end-start)
spark mllib脚本
from pyspark import SparkContext
sc = SparkContext()
from pyspark.mllib.stat import Statistics
import time
import numpy as np
L = 1000
N = 3600
t = [np.random.randn(N) for i in range(L)]
data = sc.parallelize(t)
start = time.time()
res = Statistics.corr(data, method="pearson") # spearman pearson
end = time.time()
print("pearson : ", end-start)
start = time.time()
res = Statistics.corr(data, method="spearman") # spearman pearson
end = time.time()
print("spearman: ", end-start)