92、R语言分析案例
1、读取数据
> bank=read.table("bank-full.csv",header=TRUE,sep=";") >
2、查看数据结构
> bank=read.table("bank-full.csv",header=TRUE,sep=",") > str(bank) 'data.frame': 41188 obs. of 21 variables: $ age : int 56 57 37 40 56 45 59 41 24 25 ... $ job : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ... $ marital : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ... $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ... $ default : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ... $ housing : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ... $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ... $ contact : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ... $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ... $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ... $ duration : int 261 149 226 151 307 198 139 217 380 50 ... $ campaign : int 1 1 1 1 1 1 1 1 1 1 ... $ pdays : int 999 999 999 999 999 999 999 999 999 999 ... $ previous : int 0 0 0 0 0 0 0 0 0 0 ... $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ... $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ... $ cons.price.idx: num 94 94 94 94 94 ... $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ... $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ... $ nr.employed : num 5191 5191 5191 5191 5191 ... $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
3、查看摘要统计量
> summary(bank) age job marital education Min. :17.00 admin. :10422 divorced: 4612 university.degree :12168 1st Qu.:32.00 blue-collar: 9254 married :24928 high.school : 9515 Median :38.00 technician : 6743 single :11568 basic.9y : 6045 Mean :40.02 services : 3969 unknown : 80 professional.course: 5243 3rd Qu.:47.00 management : 2924 basic.4y : 4176 Max. :98.00 retired : 1720 basic.6y : 2292 (Other) : 6156 (Other) : 1749 default housing loan contact month no :32588 no :18622 no :33950 cellular :26144 may :13769 unknown: 8597 unknown: 990 unknown: 990 telephone:15044 jul : 7174 yes : 3 yes :21576 yes : 6248 aug : 6178 jun : 5318 nov : 4101 apr : 2632 (Other): 2016 day_of_week duration campaign pdays previous fri:7827 Min. : 0.0 Min. : 1.000 Min. : 0.0 Min. :0.000 mon:8514 1st Qu.: 102.0 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000 thu:8623 Median : 180.0 Median : 2.000 Median :999.0 Median :0.000 tue:8090 Mean : 258.3 Mean : 2.568 Mean :962.5 Mean :0.173 wed:8134 3rd Qu.: 319.0 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.000 Max. :4918.0 Max. :56.000 Max. :999.0 Max. :7.000 poutcome emp.var.rate cons.price.idx cons.conf.idx failure : 4252 Min. :-3.40000 Min. :92.20 Min. :-50.8 nonexistent:35563 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7 success : 1373 Median : 1.10000 Median :93.75 Median :-41.8 Mean : 0.08189 Mean :93.58 Mean :-40.5 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4 Max. : 1.40000 Max. :94.77 Max. :-26.9 euribor3m nr.employed y Min. :0.634 Min. :4964 no :36548 1st Qu.:1.344 1st Qu.:5099 yes: 4640 Median :4.857 Median :5191 Mean :3.621 Mean :5167 3rd Qu.:4.961 3rd Qu.:5228 Max. :5.045 Max. :5228
> psych::describe(bank) vars n mean sd median trimmed mad min max age 1 41188 40.02 10.42 38.00 39.30 10.38 17.00 98.00 job* 2 41188 4.72 3.59 3.00 4.48 2.97 1.00 12.00 marital* 3 41188 2.17 0.61 2.00 2.21 0.00 1.00 4.00 education* 4 41188 4.75 2.14 4.00 4.88 2.97 1.00 8.00 default* 5 41188 1.21 0.41 1.00 1.14 0.00 1.00 3.00 housing* 6 41188 2.07 0.99 3.00 2.09 0.00 1.00 3.00 loan* 7 41188 1.33 0.72 1.00 1.16 0.00 1.00 3.00 contact* 8 41188 1.37 0.48 1.00 1.33 0.00 1.00 2.00 month* 9 41188 5.23 2.32 5.00 5.31 2.97 1.00 10.00 day_of_week* 10 41188 3.00 1.40 3.00 3.01 1.48 1.00 5.00 duration 11 41188 258.29 259.28 180.00 210.61 139.36 0.00 4918.00 campaign 12 41188 2.57 2.77 2.00 1.99 1.48 1.00 56.00 pdays 13 41188 962.48 186.91 999.00 999.00 0.00 0.00 999.00 previous 14 41188 0.17 0.49 0.00 0.05 0.00 0.00 7.00 poutcome* 15 41188 1.93 0.36 2.00 2.00 0.00 1.00 3.00 emp.var.rate 16 41188 0.08 1.57 1.10 0.27 0.44 -3.40 1.40 cons.price.idx 17 41188 93.58 0.58 93.75 93.58 0.56 92.20 94.77 cons.conf.idx 18 41188 -40.50 4.63 -41.80 -40.60 6.52 -50.80 -26.90 euribor3m 19 41188 3.62 1.73 4.86 3.81 0.16 0.63 5.04 nr.employed 20 41188 5167.04 72.25 5191.00 5178.43 55.00 4963.60 5228.10 y* 21 41188 1.11 0.32 1.00 1.02 0.00 1.00 2.00 range skew kurtosis se age 81.00 0.78 0.79 0.05 job* 11.00 0.45 -1.39 0.02 marital* 3.00 -0.06 -0.34 0.00 education* 7.00 -0.24 -1.21 0.01 default* 2.00 1.44 0.07 0.00 housing* 2.00 -0.14 -1.95 0.00 loan* 2.00 1.82 1.38 0.00 contact* 1.00 0.56 -1.69 0.00 month* 9.00 -0.31 -1.03 0.01 day_of_week* 4.00 0.01 -1.27 0.01 duration 4918.00 3.26 20.24 1.28 campaign 55.00 4.76 36.97 0.01 pdays 999.00 -4.92 22.23 0.92 previous 7.00 3.83 20.11 0.00 poutcome* 2.00 -0.88 3.98 0.00 emp.var.rate 4.80 -0.72 -1.06 0.01 cons.price.idx 2.57 -0.23 -0.83 0.00 cons.conf.idx 23.90 0.30 -0.36 0.02 euribor3m 4.41 -0.71 -1.41 0.01 nr.employed 264.50 -1.04 0.00 0.36 y* 1.00 2.45 4.00 0.00
4、查看数据是否有缺失
> sapply(bank,anyNA) age job marital education default FALSE FALSE FALSE FALSE FALSE housing loan contact month day_of_week FALSE FALSE FALSE FALSE FALSE duration campaign pdays previous poutcome FALSE FALSE FALSE FALSE FALSE emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed FALSE FALSE FALSE FALSE FALSE y FALSE >
5、单变量频数分析
> table(bank$y) no yes 36548 4640 >
6、两个变量的交叉列联表
> table(bank$y,bank$marital) divorced married single unknown no 4136 22396 9948 68 yes 476 2532 1620 12 >
> xtabs(~y+marital,data=bank) marital y divorced married single unknown no 4136 22396 9948 68 yes 476 2532 1620 12 >
7、
> prop.table(tab,1) divorced married single unknown no 0.113166247 0.612783189 0.272189997 0.001860567 yes 0.102586207 0.545689655 0.349137931 0.002586207 > prop.table(tab,2) divorced married single unknown no 0.8967910 0.8984275 0.8599585 0.8500000 yes 0.1032090 0.1015725 0.1400415 0.1500000 >
8、构建更复杂的Table
> ftable(bank[,c(3,4,21)],row.vars = c(1,2),col.vars = "y") y no yes marital education divorced basic.4y 406 83 basic.6y 169 13 basic.9y 534 31 high.school 1086 107 illiterate 1 1 professional.course 596 61 university.degree 1177 160 unknown 167 20 married basic.4y 2915 313 basic.6y 1628 139 basic.9y 3858 298 high.school 4683 475 illiterate 12 3 professional.course 2799 357 university.degree 5573 821 unknown 928 126 single basic.4y 422 31 basic.6y 301 36 basic.9y 1174 142 high.school 2702 448 illiterate 1 0 professional.course 1247 177 university.degree 3723 683 unknown 378 103 unknown basic.4y 5 1 basic.6y 6 0 basic.9y 6 2 high.school 13 1 illiterate 0 0 professional.course 6 0 university.degree 25 6 unknown 7 2 >
9、卡方检验
> tab divorced married single unknown no 4136 22396 9948 68 yes 476 2532 1620 12
> chisq.test(tab) Pearson's Chi-squared test data: tab X-squared = 122.66, df = 3, p-value < 2.2e-16 >
10、连续数据可视化
> hist(bank$age) >
11、连续变量的分布
> library(lattice) > densityplot(~age,groups=y,data=bank,plot.point=FALSE,auto.key = TRUE) >