numpy实现PSI指标计算
计算方法
population stability index, 群体稳定性指标,比较特征的分布在两个样本空间内的差异度,计算公式:
\(PSI = \sum\limits_{i=1}^{n} (A_i-E_i) * ln ( \frac{A_i} {E_i} )\)
参数 | 说明 |
---|---|
\(A_i\) | 分箱内真实(Actual)样本个数占比 |
\(E_i\) | 分箱内期望(Except)样本个数占比, 在机器学习里就是预测结果 |
\(n\) | 分箱的个数 |
实现代码
import numpy as np
def calc_psi(train_proba, test_proba, n_bins=10, eps=1e-6):
def calc_ratio(y_proba):
y_proba_1d = y_proba.reshape(1, -1)
ratios = []
for i, interval in enumerate(intervals):
if i == len(interval) - 1:
# include the probability==1
n_samples = (y_proba_1d[np.where((y_proba_1d >= interval[0]) & (y_proba_1d <= interval[1]))]).shape[0]
else:
n_samples = (y_proba_1d[np.where((y_proba_1d >= interval[0]) & (y_proba_1d < interval[1]))]).shape[0]
ratio = n_samples / y_proba.shape[0]
if ratio == 0:
ratios.append(eps)
else:
ratios.append(ratio)
return np.array(ratios)
distance = 1 / n_bins
intervals = [(i * distance, (i+1) * distance) for i in range(n_bins)]
train_ratio = calc_ratio(train_proba)
test_ratio = calc_ratio(test_proba)
return np.sum((train_ratio - test_ratio) * np.log(train_ratio / test_ratio))
测试
import numpy as np
np.random.seed(324)
probas = np.random.random(10000).reshape(-1, 1)
train_proba = probas[: 8000]
test_proba = probas[8000: ]
calc_psi(train_proba, test_proba)
# output
# 0.007639628811739914