numpy实现PSI指标计算

计算方法
population stability index, 群体稳定性指标,比较特征的分布在两个样本空间内的差异度,计算公式:

\(PSI = \sum\limits_{i=1}^{n} (A_i-E_i) * ln ( \frac{A_i} {E_i} )\)

参数 说明
\(A_i\) 分箱内真实(Actual)样本个数占比
\(E_i\) 分箱内期望(Except)样本个数占比, 在机器学习里就是预测结果
\(n\) 分箱的个数

实现代码

import numpy as np
def calc_psi(train_proba, test_proba, n_bins=10, eps=1e-6):
    def calc_ratio(y_proba):
        y_proba_1d = y_proba.reshape(1, -1)
        ratios = []
        for i, interval in enumerate(intervals):
            if i == len(interval) - 1:
                # include the probability==1
                n_samples = (y_proba_1d[np.where((y_proba_1d >= interval[0]) & (y_proba_1d <= interval[1]))]).shape[0]
            else:
                n_samples = (y_proba_1d[np.where((y_proba_1d >= interval[0]) & (y_proba_1d < interval[1]))]).shape[0]
            ratio = n_samples / y_proba.shape[0]
            if ratio == 0:
                ratios.append(eps)
            else:
                ratios.append(ratio)
        return np.array(ratios)

    distance = 1 / n_bins
    intervals = [(i * distance, (i+1) * distance) for i in range(n_bins)]
    train_ratio = calc_ratio(train_proba)
    test_ratio = calc_ratio(test_proba)
    return np.sum((train_ratio - test_ratio) * np.log(train_ratio / test_ratio))

测试

import numpy as np
np.random.seed(324)

probas = np.random.random(10000).reshape(-1, 1)
train_proba = probas[: 8000]
test_proba = probas[8000: ]
calc_psi(train_proba, test_proba)
# output
# 0.007639628811739914
posted @ 2023-04-19 19:04  oaksharks  阅读(228)  评论(0编辑  收藏  举报