92、R语言分析案例

1、读取数据

> bank=read.table("bank-full.csv",header=TRUE,sep=";")
> 

2、查看数据结构

> bank=read.table("bank-full.csv",header=TRUE,sep=",")
> str(bank)
'data.frame':    41188 obs. of  21 variables:
 $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
 $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
 $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
 $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
 $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
 $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
 $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
 $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
 $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
 $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
 $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
 $ cons.price.idx: num  94 94 94 94 94 ...
 $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
 $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
 $ nr.employed   : num  5191 5191 5191 5191 5191 ...
 $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

3、查看摘要统计量

> summary(bank)
      age                 job            marital                    education    
 Min.   :17.00   admin.     :10422   divorced: 4612   university.degree  :12168  
 1st Qu.:32.00   blue-collar: 9254   married :24928   high.school        : 9515  
 Median :38.00   technician : 6743   single  :11568   basic.9y           : 6045  
 Mean   :40.02   services   : 3969   unknown :   80   professional.course: 5243  
 3rd Qu.:47.00   management : 2924                    basic.4y           : 4176  
 Max.   :98.00   retired    : 1720                    basic.6y           : 2292  
                 (Other)    : 6156                    (Other)            : 1749  
    default         housing           loan            contact          month      
 no     :32588   no     :18622   no     :33950   cellular :26144   may    :13769  
 unknown: 8597   unknown:  990   unknown:  990   telephone:15044   jul    : 7174  
 yes    :    3   yes    :21576   yes    : 6248                     aug    : 6178  
                                                                   jun    : 5318  
                                                                   nov    : 4101  
                                                                   apr    : 2632  
                                                                   (Other): 2016  
 day_of_week    duration         campaign          pdays          previous    
 fri:7827    Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   Min.   :0.000  
 mon:8514    1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000  
 thu:8623    Median : 180.0   Median : 2.000   Median :999.0   Median :0.000  
 tue:8090    Mean   : 258.3   Mean   : 2.568   Mean   :962.5   Mean   :0.173  
 wed:8134    3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000  
             Max.   :4918.0   Max.   :56.000   Max.   :999.0   Max.   :7.000  
                                                                              
        poutcome      emp.var.rate      cons.price.idx  cons.conf.idx  
 failure    : 4252   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
 nonexistent:35563   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
 success    : 1373   Median : 1.10000   Median :93.75   Median :-41.8  
                     Mean   : 0.08189   Mean   :93.58   Mean   :-40.5  
                     3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
                     Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
                                                                       
   euribor3m      nr.employed     y        
 Min.   :0.634   Min.   :4964   no :36548  
 1st Qu.:1.344   1st Qu.:5099   yes: 4640  
 Median :4.857   Median :5191              
 Mean   :3.621   Mean   :5167              
 3rd Qu.:4.961   3rd Qu.:5228              
 Max.   :5.045   Max.   :5228             
> psych::describe(bank)
               vars     n    mean     sd  median trimmed    mad     min     max
age               1 41188   40.02  10.42   38.00   39.30  10.38   17.00   98.00
job*              2 41188    4.72   3.59    3.00    4.48   2.97    1.00   12.00
marital*          3 41188    2.17   0.61    2.00    2.21   0.00    1.00    4.00
education*        4 41188    4.75   2.14    4.00    4.88   2.97    1.00    8.00
default*          5 41188    1.21   0.41    1.00    1.14   0.00    1.00    3.00
housing*          6 41188    2.07   0.99    3.00    2.09   0.00    1.00    3.00
loan*             7 41188    1.33   0.72    1.00    1.16   0.00    1.00    3.00
contact*          8 41188    1.37   0.48    1.00    1.33   0.00    1.00    2.00
month*            9 41188    5.23   2.32    5.00    5.31   2.97    1.00   10.00
day_of_week*     10 41188    3.00   1.40    3.00    3.01   1.48    1.00    5.00
duration         11 41188  258.29 259.28  180.00  210.61 139.36    0.00 4918.00
campaign         12 41188    2.57   2.77    2.00    1.99   1.48    1.00   56.00
pdays            13 41188  962.48 186.91  999.00  999.00   0.00    0.00  999.00
previous         14 41188    0.17   0.49    0.00    0.05   0.00    0.00    7.00
poutcome*        15 41188    1.93   0.36    2.00    2.00   0.00    1.00    3.00
emp.var.rate     16 41188    0.08   1.57    1.10    0.27   0.44   -3.40    1.40
cons.price.idx   17 41188   93.58   0.58   93.75   93.58   0.56   92.20   94.77
cons.conf.idx    18 41188  -40.50   4.63  -41.80  -40.60   6.52  -50.80  -26.90
euribor3m        19 41188    3.62   1.73    4.86    3.81   0.16    0.63    5.04
nr.employed      20 41188 5167.04  72.25 5191.00 5178.43  55.00 4963.60 5228.10
y*               21 41188    1.11   0.32    1.00    1.02   0.00    1.00    2.00
                 range  skew kurtosis   se
age              81.00  0.78     0.79 0.05
job*             11.00  0.45    -1.39 0.02
marital*          3.00 -0.06    -0.34 0.00
education*        7.00 -0.24    -1.21 0.01
default*          2.00  1.44     0.07 0.00
housing*          2.00 -0.14    -1.95 0.00
loan*             2.00  1.82     1.38 0.00
contact*          1.00  0.56    -1.69 0.00
month*            9.00 -0.31    -1.03 0.01
day_of_week*      4.00  0.01    -1.27 0.01
duration       4918.00  3.26    20.24 1.28
campaign         55.00  4.76    36.97 0.01
pdays           999.00 -4.92    22.23 0.92
previous          7.00  3.83    20.11 0.00
poutcome*         2.00 -0.88     3.98 0.00
emp.var.rate      4.80 -0.72    -1.06 0.01
cons.price.idx    2.57 -0.23    -0.83 0.00
cons.conf.idx    23.90  0.30    -0.36 0.02
euribor3m         4.41 -0.71    -1.41 0.01
nr.employed     264.50 -1.04     0.00 0.36
y*                1.00  2.45     4.00 0.00

4、查看数据是否有缺失

> sapply(bank,anyNA)
           age            job        marital      education        default 
         FALSE          FALSE          FALSE          FALSE          FALSE 
       housing           loan        contact          month    day_of_week 
         FALSE          FALSE          FALSE          FALSE          FALSE 
      duration       campaign          pdays       previous       poutcome 
         FALSE          FALSE          FALSE          FALSE          FALSE 
  emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
         FALSE          FALSE          FALSE          FALSE          FALSE 
             y 
         FALSE 
> 

5、单变量频数分析

> table(bank$y)

   no   yes 
36548  4640 
> 

6、两个变量的交叉列联表

> table(bank$y,bank$marital)
     
      divorced married single unknown
  no      4136   22396   9948      68
  yes      476    2532   1620      12
> 

> xtabs(~y+marital,data=bank)
     marital
y     divorced married single unknown
  no      4136   22396   9948      68
  yes      476    2532   1620      12
> 

7、

> prop.table(tab,1)
     
         divorced     married      single     unknown
  no  0.113166247 0.612783189 0.272189997 0.001860567
  yes 0.102586207 0.545689655 0.349137931 0.002586207
> prop.table(tab,2)
     
       divorced   married    single   unknown
  no  0.8967910 0.8984275 0.8599585 0.8500000
  yes 0.1032090 0.1015725 0.1400415 0.1500000
> 

8、构建更复杂的Table

> ftable(bank[,c(3,4,21)],row.vars = c(1,2),col.vars = "y")
                             y   no  yes
marital  education                      
divorced basic.4y               406   83
         basic.6y               169   13
         basic.9y               534   31
         high.school           1086  107
         illiterate               1    1
         professional.course    596   61
         university.degree     1177  160
         unknown                167   20
married  basic.4y              2915  313
         basic.6y              1628  139
         basic.9y              3858  298
         high.school           4683  475
         illiterate              12    3
         professional.course   2799  357
         university.degree     5573  821
         unknown                928  126
single   basic.4y               422   31
         basic.6y               301   36
         basic.9y              1174  142
         high.school           2702  448
         illiterate               1    0
         professional.course   1247  177
         university.degree     3723  683
         unknown                378  103
unknown  basic.4y                 5    1
         basic.6y                 6    0
         basic.9y                 6    2
         high.school             13    1
         illiterate               0    0
         professional.course      6    0
         university.degree       25    6
         unknown                  7    2
> 

9、卡方检验

> tab
     
      divorced married single unknown
  no      4136   22396   9948      68
  yes      476    2532   1620      12
> chisq.test(tab)

    Pearson's Chi-squared test

data:  tab
X-squared = 122.66, df = 3, p-value < 2.2e-16

> 

10、连续数据可视化

> hist(bank$age)
> 

11、连续变量的分布

> library(lattice)
> densityplot(~age,groups=y,data=bank,plot.point=FALSE,auto.key = TRUE)
> 

posted @ 2017-06-02 15:39  香港胖仔  阅读(1909)  评论(0编辑  收藏  举报