excel做回归分析的应用【风控数据分析】
方法1
统计逻辑:统计一个loginname的所有去重的通讯录数C,统计这个Loginname对应的每个设备对应的通讯录c1,c2,c3…cn;
X=(c1/c+c2/c+c3/c+….cn/c)/n=(c1+c2+…cn)/(c*n)
数据表现如下:
将3个以上设备计算出来的结果关联上黑名单库,在8092个loginname中,有3915个,
占48.4%。将是否黑名单当作Y值,X值为自变量进行回归:
Y与X之间的相关性为0.1229,显著性正相关。
回归统计 |
|
|
|
|
|
|
|
|
Multiple R |
0.12294864 |
|
|
|
|
|
|
|
R Square |
0.01511637 |
|
|
|
|
|
|
|
Adjusted R Square |
0.01499461 |
|
|
|
|
|
|
|
标准误差 |
0.49600959 |
|
|
|
|
|
|
|
观测值 |
8091 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
方差分析 |
|
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
|
回归分析 |
1 |
30.54481 |
30.54481 |
124.153 |
1.26E-28 |
|
|
|
残差 |
8089 |
1990.1 |
0.246026 |
|
|
|
|
|
总计 |
8090 |
2020.645 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
标准误差 |
t Stat |
P-value |
Lower 95% |
Upper 95% |
下限 95.0% |
上限 95.0% |
Intercept |
0.65722473 |
0.016506 |
39.81648 |
0 |
0.624868 |
0.689581 |
0.624868 |
0.689581 |
X Variable 1 |
-0.2788014 |
0.025022 |
-11.1424 |
1.26E-28 |
-0.32785 |
-0.22975 |
-0.32785 |
-0.22975 |
方法2
统计逻辑:计算出每2个设备之间通讯录的相同率,如Xab,Xac,Xad,Xcd…XnXn-1,关联上黑名单库,在当时7383的大于等于3个设备的loginname中,按照设备个数的不同分组,然后进行回归,
3个设备的loginname有6512个,其中黑名单3294个,占比50.6%,X与Y的关系程正弱相关,前2个之间的相同率(自变量)对Y的影响最大。
Multiple R |
0.119978 |
|
|
|
|
|
|
|
R Square |
0.014395 |
|
|
|
|
|
|
|
Adjusted R Square |
0.01394 |
|
|
|
|
|
|
|
标准误差 |
0.496507 |
|
|
|
|
|
|
|
观测值 |
6512 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
方差分析 |
|
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
|
回归分析 |
3 |
23.43156 |
7.810518 |
31.68321 |
2.52E-20 |
|
|
|
残差 |
6508 |
1604.347 |
0.246519 |
|
|
|
|
|
总计 |
6511 |
1627.778 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
标准误差 |
t Stat |
P-value |
Lower 95% |
Upper 95% |
下限 95.0% |
上限 95.0% |
Intercept |
0.40805 |
0.011882 |
34.34126 |
1.1E-237 |
0.384757 |
0.431344 |
0.384757 |
0.431344 |
X Variable 1 |
0.115797 |
0.039147 |
2.958019 |
0.003107 |
0.039056 |
0.192537 |
0.039056 |
0.192537 |
X Variable 2 |
0.163985 |
0.039448 |
4.156987 |
3.27E-05 |
0.086654 |
0.241316 |
0.086654 |
0.241316 |
X Variable 3 |
0.087144 |
0.038996 |
2.234676 |
0.025472 |
0.010699 |
0.16359 |
0.010699 |
0.16359 |
4个设备的组,统计回归相对高0.05个百分点的相关性,但是相关性依然较弱,第一个和第6个变量起到决定性影响,5个以上的设备存在高度正相关,但是因为数据量太少,
不认为具有统计学代表意义。
回归统计 |
|
|
|
|
|
|
|
|
|
|
Multiple R |
0.171505 |
|
|
|
|
|
|
|
|
|
R Square |
0.029414 |
|
|
|
|
|
|
|
|
|
Adjusted R Square |
0.020577 |
|
|
|
|
|
|
|
|
|
标准误差 |
0.493803 |
|
|
|
|
|
|
|
|
|
观测值 |
666 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
方差分析 |
|
|
|
|
|
|
|
|
|
|
|
df |
SS |
MS |
|
F |
Significance F |
|
|
|
|
回归分析 |
6 |
4.869808 |
0.811635 |
|
3.32853 |
0.003088 |
|
|
|
|
残差 |
659 |
160.6918 |
0.243842 |
|
|
|
|
|
|
|
总计 |
665 |
165.5616 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
标准误差 |
t Stat |
|
P-value |
Lower 95% |
Upper 95% |
|
下限 95.0% |
上限 95.0% |
Intercept |
0.355842 |
0.071268 |
4.993029 |
|
7.62E-07 |
0.215903 |
0.495781 |
|
0.215903 |
0.495781 |
X Variable 1 |
0.209455 |
0.131209 |
1.596339 |
|
0.110893 |
-0.04818 |
0.467094 |
|
-0.04818 |
0.467094 |
X Variable 2 |
0.089427 |
0.136561 |
0.654849 |
|
0.512793 |
-0.17872 |
0.357575 |
|
-0.17872 |
0.357575 |
X Variable 3 |
0.09917 |
0.127629 |
0.777021 |
|
0.437425 |
-0.15144 |
0.349778 |
|
-0.15144 |
0.349778 |
X Variable 4 |
-0.10775 |
0.144002 |
-0.74823 |
|
0.454591 |
-0.3905 |
0.175012 |
|
-0.3905 |
0.175012 |
X Variable 5 |
-0.14237 |
0.121381 |
-1.17293 |
|
0.241249 |
-0.38071 |
0.095969 |
|
-0.38071 |
0.095969 |
方法3
没能获取用户更换设备的时间,分析业务逻辑,可以通过手机号登陆APP时候对应的版本号来间接代表更换设备的时间。成为自变量的X有:设备数,更换设备的次数,每2个设备更换的最大时间差,最小时间差,平均更换时间,更换设备时间差的标准差,累计更换时间,进行回归得到如下结果:设备累计更换时间和更换时间差的标准差成显著反比。
回归统计 |
|
||||||||
Multiple R |
0.154438366 |
|
|||||||
R Square |
0.023851209 |
|
|||||||
Adjusted R Square |
0.023259245 |
|
|||||||
标准误差 |
0.492116372 |
|
|||||||
观测值 |
9901 |
|
|||||||
|
|||||||||
方差分析 |
|
||||||||
|
df |
SS |
MS |
|
F |
Significance F |
|||
回归分析 |
6 |
58.54663 |
9.757771812 |
|
40.29164794 |
9.68003E-49 |
|||
残差 |
9894 |
2396.114 |
0.242178524 |
|
|||||
总计 |
9900 |
2454.661 |
|
|
|
|
|||
|
|||||||||
|
Coefficients |
标准误差 |
t Stat |
|
P-value |
Lower 95% |
Upper 95% |
下限 95.0% |
上限 95.0% |
Intercept |
0.448879765 |
0.032716 |
13.72029481 |
|
1.87579E-42 |
0.384748795 |
0.513010734 |
0.384749 |
0.513011 |
X Variable 1 |
0.054091353 |
0.010083 |
5.364858051 |
|
8.28418E-08 |
0.034327536 |
0.073855171 |
0.034328 |
0.073855 |
X Variable 2 |
-0.014132904 |
0.001316 |
-10.74041732 |
|
9.23172E-27 |
-0.016712261 |
-0.011553547 |
-0.01671 |
-0.01155 |
X Variable 3 |
0.000177389 |
3.03E-05 |
5.85593915 |
|
4.8943E-09 |
0.00011801 |
0.000236767 |
0.000118 |
0.000237 |
X Variable 4 |
1.01037E-08 |
1.25E-09 |
8.067239472 |
|
8.02511E-16 |
7.64868E-09 |
1.25587E-08 |
7.65E-09 |
1.26E-08 |
X Variable 5 |
-7.0701E-09 |
1.23E-09 |
-5.741173492 |
|
9.67925E-09 |
-9.48403E-09 |
-4.65616E-09 |
-9.5E-09 |
-4.7E-09 |
X Variable 6 |
-3.40952E-08 |
3.46E-09 |
-9.855884854 |
|
8.23224E-23 |
-4.08763E-08 |
-2.73141E-08 |
-4.1E-08 |
-2.7E-08 |
方法4
回归统计 |
|
|
|
|
|
|
|
|
Multiple R |
0.211204 |
|
|
|
|
|
|
|
R Square |
0.044607 |
|
|
|
|
|
|
|
Adjusted R Square |
0.043425 |
|
|
|
|
|
|
|
标准误差 |
0.488799 |
|
|
|
|
|
|
|
观测值 |
8091 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
方差分析 |
|
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
|
回归分析 |
10 |
90.13552 |
9.013552 |
37.72553 |
3.87E-73 |
|
|
|
残差 |
8080 |
1930.51 |
0.238924 |
|
|
|
|
|
总计 |
8090 |
2020.645 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
标准误差 |
t Stat |
P-value |
Lower 95% |
Upper 95% |
下限 95.0% |
上限 95.0% |
Intercept |
0.824792 |
0.042672 |
19.32859 |
2.08E-81 |
0.741144 |
0.90844 |
0.741144 |
0.90844039 |
X Variable 1 |
0.000604 |
0.01188 |
0.050819 |
0.959471 |
-0.02269 |
0.023893 |
-0.02269 |
0.02389257 |
X Variable 2 |
-0.01201 |
0.001585 |
-7.57658 |
3.94E-14 |
-0.01512 |
-0.0089 |
-0.01512 |
-0.0089046 |
X Variable 3 |
0.000106 |
3.51E-05 |
3.01029 |
0.002618 |
3.68E-05 |
0.000174 |
3.68E-05 |
0.0001743 |
X Variable 4 |
1.11E-08 |
2.04E-09 |
5.45342 |
5.09E-08 |
7.12E-09 |
1.51E-08 |
7.12E-09 |
1.5115E-08 |
X Variable 5 |
-7.5E-09 |
1.78E-09 |
-4.239 |
2.27E-05 |
-1.1E-08 |
-4E-09 |
-1.1E-08 |
-4.05E-09 |
X Variable 6 |
4.7E-10 |
2.6E-09 |
0.181014 |
0.856361 |
-4.6E-09 |
5.56E-09 |
-4.6E-09 |
5.5622E-09 |
X Variable 7 |
-2.2E-08 |
4.59E-09 |
-4.85498 |
1.23E-06 |
-3.1E-08 |
-1.3E-08 |
-3.1E-08 |
-1.329E-08 |
X Variable 8 |
2.45E-11 |
1.14E-11 |
2.149089 |
0.031657 |
2.15E-12 |
4.68E-11 |
2.15E-12 |
4.68E-11 |
X Variable 9 |
-0.00266 |
0.000444 |
-5.99087 |
2.18E-09 |
-0.00353 |
-0.00179 |
-0.00353 |
-0.0017881 |
X Variable 10 |
-0.24645 |
0.02494 |
-9.88193 |
6.72E-23 |
-0.29534 |
-0.19757 |
-0.29534 |
-0.1975655 |
单个X对Y都有一定影响,现在将设备更换时间的频率、更换频率标准偏差、每个设备的平均更换周期、设备相同率几个维度组合成Xn变量回归,结果显示X对Y的总体复相关性已经有较大幅度增加,达到0.211204。
虽然这些变量对判定是否黑名单的决定性怡然较小,需要挖掘更多的用户特征进行丰富。
dc |
dcs |
abrankcount |
maxtime |
mintime |
avgtime |
stddevt |
totaltime |
interval |
pc |
3 |
3 |
3 |
3237421 |
-4064 |
2155571.333 |
1527094 |
6466714 |
13 |
0.457187 |
3 |
3 |
3 |
20561466 |
9930757 |
13707644 |
4854801 |
41122932 |
80 |
0.57037 |
3 |
3 |
3 |
18509989 |
-19642609 |
-755080 |
15578021 |
2265240 |
76 |
0.343496 |
3 |
3 |
3 |
-14023 |
-1893173 |
-1262115.333 |
882553.1 |
3786346 |
8 |
0.335878 |
方法5
回归统计 |
|
|
|
|
|
|
|
|
Multiple R |
0.506696 |
|
|
|
|
|
|
|
R Square |
0.256741 |
|
|
|
|
|
|
|
Adjusted R Square |
0.256666 |
|
|
|
|
|
|
|
标准误差 |
0.429309 |
|
|
|
|
|
|
|
观测值 |
9901 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
方差分析 |
|
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
|
回归分析 |
1 |
630.2114 |
630.2114 |
3419.367 |
0 |
|
|
|
残差 |
9899 |
1824.45 |
0.184306 |
|
|
|
|
|
总计 |
9900 |
2454.661 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
标准误差 |
t Stat |
P-value |
Lower 95% |
Upper 95% |
下限 95.0% |
上限 95.0% |
Intercept |
0.760723 |
0.006787 |
112.093 |
0 |
0.74742 |
0.774026 |
0.74742 |
0.774026 |
X Variable 1 |
-0.19139 |
0.003273 |
-58.4754 |
0 |
-0.1978 |
-0.18497 |
-0.1978 |
-0.18497 |
求出每个手机更换的设备中是IOS的数量,回归出来的结果显示X单变量对Y的影响是显著正相关,达到0.506696
方法6
将以上所有变量加入,形成的特征如下:
ifblack |
dc |
dcs |
abrankcount |
maxtime |
mintime |
avgtime |
stddevt |
totaltime |
interval |
pc |
deiosnum |
1 |
3 |
3 |
3 |
3237421 |
-4064 |
2155571.333 |
1527094 |
6466714 |
13 |
0.457 |
2 |
0 |
3 |
3 |
3 |
20561466 |
9930757 |
13707644 |
4854801 |
41122932 |
80 |
0.57 |
2 |
1 |
3 |
3 |
3 |
18509989 |
-19642609 |
-755080 |
15578021 |
2265240 |
76 |
0.343 |
0 |
0 |
3 |
3 |
3 |
-14023 |
-1893173 |
-1262115.333 |
882553.1 |
3786346 |
8 |
0.336 |
1 |
1 |
3 |
3 |
3 |
-91956 |
-724724 |
-483149.3333 |
279151.3 |
1449448 |
3 |
0.944 |
1 |
1 |
3 |
3 |
3 |
412948 |
-845102 |
-288102.6667 |
523600.1 |
864308 |
4 |
0.344 |
0 |
0 |
3 |
3 |
3 |
5092634 |
1290609 |
3395089.333 |
1578617 |
10185268 |
20 |
0.335 |
1 |
回归分析相关复系数达到0.547115,较高的因果关系
回归统计 |
|
|
|
|
|
|
|
|
Multiple R |
0.547115 |
|
|
|
|
|
|
|
R Square |
0.299335 |
|
|
|
|
|
|
|
Adjusted R Square |
0.298556 |
|
|
|
|
|
|
|
标准误差 |
0.417037 |
|
|
|
|
|
|
|
观测值 |
9901 |
|
|
|
|
|
|
|
方差分析 |
|
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
|
回归分析 |
11 |
734.7663 |
66.79693 |
384.067 |
0 |
|
|
|
残差 |
9889 |
1719.895 |
0.17392 |
|
|
|
|
|
总计 |
9900 |
2454.661 |
|
C |
|
|
|
|
|
Coefficients |
标准误差 |
t Stat |
P-value |
Lower 95% |
Upper 95% |
下限 95.0% |
上限 95.0% |
Intercept |
0.173297 |
0.033632 |
5.152794 |
2.62E-07 |
0.107372 |
0.239222 |
0.107372 |
0.239222 |
X Variable 1 |
0.190591 |
0.010059 |
18.94818 |
1.12E-78 |
0.170874 |
0.210307 |
0.170874 |
0.210307 |
X Variable 2 |
-0.00304 |
0.001233 |
-2.46576 |
0.013689 |
-0.00546 |
-0.00062 |
-0.00546 |
-0.00062 |
X Variable 3 |
1.19E-05 |
2.81E-05 |
0.423152 |
0.672194 |
-4.3E-05 |
6.7E-05 |
-4.3E-05 |
6.7E-05 |
X Variable 4 |
5.53E-09 |
1.51E-09 |
3.651315 |
0.000262 |
2.56E-09 |
8.49E-09 |
2.56E-09 |
8.49E-09 |
X Variable 5 |
-6.4E-09 |
1.32E-09 |
-4.80573 |
1.56E-06 |
-8.9E-09 |
-3.8E-09 |
-8.9E-09 |
-3.8E-09 |
X Variable 6 |
1.14E-09 |
1.92E-09 |
0.592423 |
0.553581 |
-2.6E-09 |
4.91E-09 |
-2.6E-09 |
4.91E-09 |
X Variable 7 |
-2.4E-08 |
3.39E-09 |
-7.09928 |
1.34E-12 |
-3.1E-08 |
-1.7E-08 |
-3.1E-08 |
-1.7E-08 |
X Variable 8 |
6.86E-12 |
8.73E-12 |
0.786271 |
0.431727 |
-1E-11 |
2.4E-11 |
-1E-11 |
2.4E-11 |
X Variable 9 |
0.000974 |
0.000332 |
2.938598 |
0.003305 |
0.000324 |
0.001624 |
0.000324 |
0.001624 |
X Variable 10 |
0.070007 |
0.013677 |
5.118475 |
3.14E-07 |
0.043197 |
0.096818 |
0.043197 |
0.096818 |
X Variable 11 |
-0.21499 |
0.003476 |
-61.8428 |
0 |
-0.22181 |
-0.20818 |
-0.22181 |
-0.20818 |