作业一:PCA降维练习
作业一:PCA降维练习
【题目】
1.现有我国大陆30个省、直辖市、自治区的经济发展状况数据集如表所示,包括8项经济指标:国民生产总值(A1);居民消费水平(A2);固定资产投资(A3);职工平均工资(A4);货物周转量(A5);居民消费指数(A6);商品零售价格指数(A7);工业总产值(A8),试用基本PCA方法将这8项经济指标融合成3项综合指标。
【要求】
1.写出PCA完成降维的主要步骤;
2.详细写出题目降维的计算过程;
3.请大家在博客中直接完成或在作业本上完成后拍照上传。
我国大陆经济发展状况数据 | ||||||||
---|---|---|---|---|---|---|---|---|
A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 | |
北京 | 1394.89 | 2505 | 519.01 | 8144 | 373.9 | 117.3 | 112.6 | 843.43 |
天津 | 920.11 | 2720 | 345.46 | 6501 | 342.8 | 115.2 | 110.6 | 582.51 |
河北 | 2849.52 | 1258 | 704.87 | 4839 | 2033.3 | 115.2 | 115.8 | 1234.85 |
山西 | 1092.48 | 1250 | 290.9 | 4721 | 717.3 | 116.9 | 115.6 | 697.25 |
内蒙古 | 832.88 | 1387 | 250.23 | 4134 | 781.7 | 117.5 | 116.8 | 419.39 |
辽宁 | 2793.37 | 2397 | 387.99 | 4911 | 1371.1 | 116.1 | 114 | 1840.55 |
吉林 | 1129.2 | 1872 | 320.45 | 4430 | 497.4 | 115.2 | 114.2 | 762.47 |
黑龙江 | 2014.53 | 2334 | 435.73 | 4145 | 824.8 | 116.1 | 114.3 | 1240.37 |
上海 | 2462.57 | 5343 | 996.48 | 9279 | 207.4 | 118.7 | 113 | 1642.95 |
江苏 | 5155.25 | 1926 | 1434.95 | 5934 | 1025.5 | 115.8 | 114.3 | 2026.64 |
浙江 | 3524.79 | 2249 | 1006.39 | 6619 | 754.4 | 116.6 | 113.5 | 916.59 |
安徽 | 2003.58 | 1254 | 474 | 4609 | 908.3 | 114.8 | 112.7 | 824.14 |
福建 | 2160.52 | 2320 | 553.97 | 5857 | 609.3 | 115.2 | 114.4 | 433.67 |
江西 | 1205.1 | 1182 | 282.84 | 4211 | 411.7 | 116.9 | 115.9 | 571.84 |
山东 | 5002.34 | 1527 | 1229.55 | 5145 | 1196.6 | 117.6 | 114.2 | 2207.69 |
河南 | 3002.74 | 1034 | 670.35 | 4344 | 1574.4 | 116.5 | 114.9 | 1367.92 |
湖北 | 2391.42 | 1527 | 571.68 | 4685 | 849 | 120 | 116.6 | 1220.72 |
湖南 | 2195.7 | 1408 | 422.61 | 4797 | 1011.8 | 119 | 115.5 | 843.83 |
广东 | 5381.72 | 2699 | 1639.83 | 8250 | 656.5 | 114 | 111.6 | 1396.35 |
广西 | 1606.15 | 1314 | 382.59 | 5150 | 556 | 118.4 | 116.4 | 554.97 |
海南 | 364.17 | 1814 | 198.35 | 5340 | 232.1 | 113.5 | 111.3 | 64.33 |
四川 | 3534 | 1261 | 822.54 | 4645 | 902.3 | 118.5 | 117 | 1431.81 |
贵州 | 630.07 | 942 | 150.84 | 4475 | 301.1 | 121.4 | 117.2 | 324.72 |
云南 | 1206.68 | 1261 | 334 | 5149 | 310.4 | 121.3 | 118.1 | 716.65 |
西藏 | 55.98 | 1110 | 17.87 | 7382 | 4.2 | 117.3 | 114.9 | 5.57 |
陕西 | 1000.03 | 1208 | 300.27 | 4396 | 500.9 | 119 | 117 | 600.98 |
甘肃 | 553.35 | 1007 | 114.81 | 5493 | 507 | 119.8 | 116.5 | 468.79 |
青海 | 165.31 | 1445 | 47.76 | 5753 | 61.6 | 118 | 116.3 | 105.8 |
宁夏 | 169.75 | 1355 | 61.98 | 5079 | 121.8 | 117.1 | 115.3 | 114.4 |
新疆 | 834.57 | 1469 | 376.95 | 5348 | 339 | 119.7 | 116.7 | 428.76 |
解题步骤:
- 导入数据:
import pandas as pd
df = pd.read_excel("../data/我国大陆经济发展状况数据.xlsx")
df
我国大陆经济发展状况数据 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | |
---|---|---|---|---|---|---|---|---|---|
0 | NaN | A1 | A2 | A3 | A4 | A5 | A6 | A7 | A8 |
1 | 北京 | 1394.89 | 2505 | 519.01 | 8144 | 373.9 | 117.3 | 112.6 | 843.43 |
2 | 天津 | 920.11 | 2720 | 345.46 | 6501 | 342.8 | 115.2 | 110.6 | 582.51 |
3 | 河北 | 2849.52 | 1258 | 704.87 | 4839 | 2033.3 | 115.2 | 115.8 | 1234.85 |
4 | 山西 | 1092.48 | 1250 | 290.9 | 4721 | 717.3 | 116.9 | 115.6 | 697.25 |
5 | 内蒙古 | 832.88 | 1387 | 250.23 | 4134 | 781.7 | 117.5 | 116.8 | 419.39 |
6 | 辽宁 | 2793.37 | 2397 | 387.99 | 4911 | 1371.1 | 116.1 | 114 | 1840.55 |
7 | 吉林 | 1129.2 | 1872 | 320.45 | 4430 | 497.4 | 115.2 | 114.2 | 762.47 |
8 | 黑龙江 | 2014.53 | 2334 | 435.73 | 4145 | 824.8 | 116.1 | 114.3 | 1240.37 |
9 | 上海 | 2462.57 | 5343 | 996.48 | 9279 | 207.4 | 118.7 | 113 | 1642.95 |
10 | 江苏 | 5155.25 | 1926 | 1434.95 | 5934 | 1025.5 | 115.8 | 114.3 | 2026.64 |
11 | 浙江 | 3524.79 | 2249 | 1006.39 | 6619 | 754.4 | 116.6 | 113.5 | 916.59 |
12 | 安徽 | 2003.58 | 1254 | 474 | 4609 | 908.3 | 114.8 | 112.7 | 824.14 |
13 | 福建 | 2160.52 | 2320 | 553.97 | 5857 | 609.3 | 115.2 | 114.4 | 433.67 |
14 | 江西 | 1205.1 | 1182 | 282.84 | 4211 | 411.7 | 116.9 | 115.9 | 571.84 |
15 | 山东 | 5002.34 | 1527 | 1229.55 | 5145 | 1196.6 | 117.6 | 114.2 | 2207.69 |
16 | 河南 | 3002.74 | 1034 | 670.35 | 4344 | 1574.4 | 116.5 | 114.9 | 1367.92 |
17 | 湖北 | 2391.42 | 1527 | 571.68 | 4685 | 849 | 120 | 116.6 | 1220.72 |
18 | 湖南 | 2195.7 | 1408 | 422.61 | 4797 | 1011.8 | 119 | 115.5 | 843.83 |
19 | 广东 | 5381.72 | 2699 | 1639.83 | 8250 | 656.5 | 114 | 111.6 | 1396.35 |
20 | 广西 | 1606.15 | 1314 | 382.59 | 5150 | 556 | 118.4 | 116.4 | 554.97 |
21 | 海南 | 364.17 | 1814 | 198.35 | 5340 | 232.1 | 113.5 | 111.3 | 64.33 |
22 | 四川 | 3534 | 1261 | 822.54 | 4645 | 902.3 | 118.5 | 117 | 1431.81 |
23 | 贵州 | 630.07 | 942 | 150.84 | 4475 | 301.1 | 121.4 | 117.2 | 324.72 |
24 | 云南 | 1206.68 | 1261 | 334 | 5149 | 310.4 | 121.3 | 118.1 | 716.65 |
25 | 西藏 | 55.98 | 1110 | 17.87 | 7382 | 4.2 | 117.3 | 114.9 | 5.57 |
26 | 陕西 | 1000.03 | 1208 | 300.27 | 4396 | 500.9 | 119 | 117 | 600.98 |
27 | 甘肃 | 553.35 | 1007 | 114.81 | 5493 | 507 | 119.8 | 116.5 | 468.79 |
28 | 青海 | 165.31 | 1445 | 47.76 | 5753 | 61.6 | 118 | 116.3 | 105.8 |
29 | 宁夏 | 169.75 | 1355 | 61.98 | 5079 | 121.8 | 117.1 | 115.3 | 114.4 |
30 | 新疆 | 834.57 | 1469 | 376.95 | 5348 | 339 | 119.7 | 116.7 | 428.76 |
- 处理数据将将数据集转换成矩阵数据:
import numpy as np
data = df.values
data = np.delete(data, 0, axis=0) # 删除第一行
data = np.delete(data, 0, axis=1) # 删除第一列
print(data)
[[1394.89 2505 519.01 8144 373.9 117.3 112.6 843.43]
[920.11 2720 345.46 6501 342.8 115.2 110.6 582.51]
[2849.52 1258 704.87 4839 2033.3 115.2 115.8 1234.85]
[1092.48 1250 290.9 4721 717.3 116.9 115.6 697.25]
[832.88 1387 250.23 4134 781.7 117.5 116.8 419.39]
[2793.37 2397 387.99 4911 1371.1 116.1 114 1840.55]
[1129.2 1872 320.45 4430 497.4 115.2 114.2 762.47]
[2014.53 2334 435.73 4145 824.8 116.1 114.3 1240.37]
[2462.57 5343 996.48 9279 207.4 118.7 113 1642.95]
[5155.25 1926 1434.95 5934 1025.5 115.8 114.3 2026.64]
[3524.79 2249 1006.39 6619 754.4 116.6 113.5 916.59]
[2003.58 1254 474 4609 908.3 114.8 112.7 824.14]
[2160.52 2320 553.97 5857 609.3 115.2 114.4 433.67]
[1205.1 1182 282.84 4211 411.7 116.9 115.9 571.84]
[5002.34 1527 1229.55 5145 1196.6 117.6 114.2 2207.69]
[3002.74 1034 670.35 4344 1574.4 116.5 114.9 1367.92]
[2391.42 1527 571.68 4685 849 120 116.6 1220.72]
[2195.7 1408 422.61 4797 1011.8 119 115.5 843.83]
[5381.72 2699 1639.83 8250 656.5 114 111.6 1396.35]
[1606.15 1314 382.59 5150 556 118.4 116.4 554.97]
[364.17 1814 198.35 5340 232.1 113.5 111.3 64.33]
[3534 1261 822.54 4645 902.3 118.5 117 1431.81]
[630.07 942 150.84 4475 301.1 121.4 117.2 324.72]
[1206.68 1261 334 5149 310.4 121.3 118.1 716.65]
[55.98 1110 17.87 7382 4.2 117.3 114.9 5.57]
...
[553.35 1007 114.81 5493 507 119.8 116.5 468.79]
[165.31 1445 47.76 5753 61.6 118 116.3 105.8]
[169.75 1355 61.98 5079 121.8 117.1 115.3 114.4]
[834.57 1469 376.95 5348 339 119.7 116.7 428.76]]
- 对每一个属性求平均值:
# 对每一个属性求平均值
MEAN = np.mean(data, axis=0)
print(MEAN)
[1921.0923333333333 1745.9333333333334 511.5083333333334 5458.833333333333
666.1199999999998 117.28666666666668 114.9066666666667 862.9980000000003]
- 去中心化:
# 去中心化
X = np.subtract(data, MEAN)
print(X)
[[-526.2023333333332 759.0666666666666 7.501666666666608
2685.166666666667 -292.2199999999998 0.013333333333321207
-2.3066666666667004 -19.568000000000325]
[-1000.9823333333333 974.0666666666666 -166.0483333333334
1042.166666666667 -323.31999999999977 -2.086666666666673
-4.3066666666667 -280.4880000000003]
[928.4276666666667 -487.9333333333334 193.36166666666662
-619.833333333333 1367.1800000000003 -2.086666666666673
0.8933333333333024 371.85199999999963]
[-828.6123333333333 -495.9333333333334 -220.6083333333334
-737.833333333333 51.18000000000018 -0.38666666666667027
0.6933333333332996 -165.74800000000027]
[-1088.2123333333334 -358.9333333333334 -261.2783333333334
-1324.833333333333 115.58000000000027 0.21333333333332405
1.8933333333333024 -443.6080000000003]
[872.2776666666666 651.0666666666666 -123.51833333333337
-547.833333333333 704.9800000000001 -1.1866666666666816
-0.9066666666666947 977.5519999999997]
[-791.8923333333332 126.0666666666666 -191.0583333333334
-1028.833333333333 -168.7199999999998 -2.086666666666673
-0.7066666666666919 -100.52800000000025]
[93.4376666666667 588.0666666666666 -75.77833333333336
-1313.833333333333 158.68000000000018 -1.1866666666666816
-0.6066666666666976 377.3719999999996]
[541.4776666666669 3597.0666666666666 484.97166666666664
...
0.39333333333330245 -748.5980000000003]
[-1086.5223333333333 -276.9333333333334 -134.5583333333334
-110.83333333333303 -327.1199999999998 2.413333333333327
1.7933333333333081 -434.2380000000003]]
- 计算协方差矩阵:
# 计算协方差矩阵
COV = np.dot(X.T, X)
COV = COV.astype(float)
print(COV)
[[ 6.30765463e+07 9.83079456e+06 1.63796766e+07 1.06363082e+07
1.21417847e+07 -2.36094691e+04 -2.14014755e+04 2.18457210e+07]
[ 9.83079456e+06 2.15303779e+07 4.28998294e+06 2.34900917e+07
-1.73553726e+06 -1.18998267e+04 -2.81121867e+04 5.30394750e+06]
[ 1.63796766e+07 4.28998294e+06 4.70718451e+06 6.10345105e+06
2.31405602e+06 -6.63718167e+03 -7.96255167e+03 5.40851284e+06]
[ 1.06363082e+07 2.34900917e+07 6.10345105e+06 4.97450582e+07
-6.22234320e+06 -1.03221667e+04 -3.88115667e+04 2.29421764e+06]
[ 1.21417847e+07 -1.73553726e+06 2.31405602e+06 -6.22234320e+06
6.13467287e+06 -6.83924200e+03 5.49936000e+02 5.13511652e+06]
[-2.36094691e+04 -1.18998267e+04 -6.63718167e+03 -1.03221667e+04
-6.83924200e+03 1.18954667e+02 8.50426667e+01 -4.29935280e+03]
[-2.14014755e+04 -2.81121867e+04 -7.96255167e+03 -3.88115667e+04
5.49936000e+02 8.50426667e+01 1.04478667e+02 -6.18060660e+03]
[ 2.18457210e+07 5.30394750e+06 5.40851284e+06 2.29421764e+06
5.13511652e+06 -4.29935280e+03 -6.18060660e+03 9.91052562e+06]]
- 求特征值和特征向量:
# 求特征值和特征向量
W, V = np.linalg.eig(COV)
print("特征值:", W) # 输出特征值
print()
print("特征向量:", V) # 输出特征向量
特征值: [8.72783416e+07 5.53770279e+07 8.72267037e+06 2.38700596e+06
1.18531002e+06 1.54098280e+05 1.22524030e+02 1.21054604e+01]
特征向量: [[ 7.60864600e-01 4.66938137e-01 -1.80373006e-01 2.52020961e-01
-1.52569263e-01 2.89350422e-01 1.94787783e-03 -9.40837517e-05]
[ 3.05787140e-01 -3.59247488e-01 8.36880703e-01 1.43107263e-01
-2.32541833e-01 5.00622299e-02 2.12804289e-03 8.90168180e-05]
[ 2.20032129e-01 6.08786086e-02 -6.32704760e-02 1.91235768e-01
-1.18225330e-01 -9.45156605e-01 -2.59674515e-03 -6.36947361e-04]
[ 4.43287300e-01 -7.46702042e-01 -4.23745871e-01 -2.48236950e-01
6.41421978e-02 2.52174669e-02 6.26941643e-05 3.69984670e-04]
[ 9.67267122e-02 2.43733428e-01 9.58250372e-02 -8.47265778e-01
-4.43979074e-01 -8.40975648e-02 2.38636532e-03 -8.41964969e-04]
[-3.37588559e-04 -3.40297392e-05 -3.13444459e-04 3.32636447e-04
4.98863558e-03 -2.30072299e-03 7.69626617e-01 -6.38470333e-01]
[-5.20827327e-04 4.98878922e-04 -4.98576575e-04 7.67414488e-05
2.37283057e-03 -2.60981247e-03 6.38461858e-01 7.69644879e-01]
[ 2.70749099e-01 1.79537294e-01 2.72671535e-01 -3.16154438e-01
8.41075223e-01 -1.12821419e-01 -4.53446883e-03 1.06096931e-03]]
- 进行降维:
# 进行降维
indexs = np.argsort(-W) # 对特征值进行排序
K = 3
V1 = np.matrix(V.T[indexs[:K]])
print(V1)
[[ 7.60864600e-01 3.05787140e-01 2.20032129e-01 4.43287300e-01
9.67267122e-02 -3.37588559e-04 -5.20827327e-04 2.70749099e-01]
[ 4.66938137e-01 -3.59247488e-01 6.08786086e-02 -7.46702042e-01
2.43733428e-01 -3.40297392e-05 4.98878922e-04 1.79537294e-01]
[-1.80373006e-01 8.36880703e-01 -6.32704760e-02 -4.23745871e-01
9.58250372e-02 -3.13444459e-04 -4.98576575e-04 2.72671535e-01]]
# 将样本点投影到选取的特征向量上获得降维后的数据
result = X * V1.T
print(result)
输出降维以后的矩阵:
[[990.1326863340707 -2597.69759168907 -441.4784642366912]
[-145.52428880114613 -1734.7886246637538 457.15948019037387]
[557.9072003283591 1483.398801571934 -92.98380961850476]
[-1197.6507742982735 311.4801215166848 20.74342271542782]
[-1691.4387100947454 542.6950193610462 363.9387813111703]
[925.6100443838645 922.2881101659871 961.5931259273799]
[-1105.617524618663 282.37521039454055 652.8132720465425]
[-230.64022211221783 915.2245965382683 1154.9187887685716]
[3478.86997429049 -3834.178366714535 1431.8978413738632]
[3279.4578222546725 1443.3827854361957 -340.7065062222017]
[2020.2551020832186 -236.93107558341126 -368.1108525884341]
[-459.7318087774081 899.5811670312297 -51.46820908417876]
[421.8241854938795 -480.09045763726107 143.32157982061096]
[-1424.120194646775 671.8223253398307 96.66226328722492]
[2711.7251771700553 2166.1743729893374 -233.9474910773763]
[370.6085563718533 1914.9717455683362 -103.83320294200522]
[75.65837706955209 988.5504337462078 171.10948851404976]
[-179.09333250231623 819.219426337683 -18.370515308614152]
[4553.537265703688 -648.5566444932142 -936.2247612478483]
[-631.0272038053428 148.7287586475032 -260.1898164907052]
[-1543.592191645697 -930.9482886476302 148.59822439208742]
[963.441699582559 1713.6543013995047 -193.8475843803448]
[-1924.6531609798114 213.04929210879607 -181.96932089735466]
[-942.2943832474804 -51.80288258527128 -208.44368276204676]
[-1165.8323234636473 -2423.791455703293 -1276.710339435131]
...
[-1460.8880574343148 -532.4110672961071 -483.81250632244604]
[-1663.0582266802576 -1242.905733280883 -294.85530675652006]
[-1974.695399075536 -688.1410791022479 -78.15599185023306]
[-1139.3297364851903 -490.9745399058489 -130.0544055313741]]