R数据预处理(二)

一、数据变换

中心化、标准化原数据:

> summary(sim.dat1)
      age           gender        income       house       store_exp       online_exp     
 Min.   :16.00   Female:554   Min.   : 41776   No :432   Min.   :155.8   Min.   :  68.82  
 1st Qu.:25.00   Male  :446   1st Qu.: 87896   Yes:568   1st Qu.:205.1   1st Qu.: 420.34  
 Median :36.00                Median : 93869             Median :329.8   Median :1941.86  
 Mean   :38.58                Mean   :109923             Mean   :373.1   Mean   :2120.18  
 3rd Qu.:53.00                3rd Qu.:119456             3rd Qu.:597.2   3rd Qu.:2440.78  
 Max.   :69.00                Max.   :319704             Max.   :597.3   Max.   :9479.44  
  store_trans     online_trans         Q1              Q2              Q3       
 Min.   : 1.00   Min.   : 1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.: 3.00   1st Qu.: 6.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
 Median : 4.00   Median :14.00   Median :3.000   Median :1.000   Median :1.000  
 Mean   : 5.35   Mean   :13.55   Mean   :3.101   Mean   :1.823   Mean   :1.992  
 3rd Qu.: 7.00   3rd Qu.:20.00   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000  
 Max.   :20.00   Max.   :36.00   Max.   :5.000   Max.   :5.000   Max.   :5.000  
       Q4              Q5              Q6              Q7              Q8       
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:2.000   1st Qu.:1.750   1st Qu.:1.000   1st Qu.:2.500   1st Qu.:1.000  
 Median :3.000   Median :4.000   Median :2.000   Median :4.000   Median :2.000  
 Mean   :2.763   Mean   :2.945   Mean   :2.448   Mean   :3.434   Mean   :2.396  
 3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:3.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
       Q9             Q10              segment   
 Min.   :1.000   Min.   :1.00   Conspicuous:200  
 1st Qu.:2.000   1st Qu.:1.00   Price      :250  
 Median :4.000   Median :2.00   Quality    :200  
 Mean   :3.085   Mean   :2.32   Style      :350  
 3rd Qu.:4.000   3rd Qu.:3.00                    
 Max.   :5.000   Max.   :5.00                    
> standard1=preProcess(sim.dat1,method=c('center','scale'))#变量减去均值,再除以标准差
> head(predict(standard1,sim.dat1))
        age gender      income house   store_exp online_exp store_trans online_trans        Q1
1 1.2989557 Female  0.24179440   Yes  0.93607103  -1.049355  -0.9064934    -1.451057 0.6199408
2 1.7219764 Female  0.26467429   Yes  0.62930797  -1.161404  -0.3653033    -1.451057 0.6199408
3 1.4399626   Male  0.09372058   Yes  0.70613556  -1.063370   0.4464818    -1.451057 1.3095300
4 1.5104660   Male  0.08088761   Yes -0.15185121  -1.142839   1.2582670    -1.451057 1.3095300
5 0.8759349   Male  0.31382955   Yes  0.03904516  -1.159840  -0.3653033    -1.199705 0.6199408
6 1.4399626   Male -0.04952922   Yes -0.20881124  -1.111638  -0.3653033    -1.074028 0.6199408

 

log变换:

apply(sim.dat1,1,log)
#语法:apply(数据框,行列标识,使用的功能函数自己定义的功能函数也可以)
> apply(sim.dat1[,c(1,3,5)],1,log)#上列中性别为类别变量,需要指定非类别变量来进行log

apply 既能对行操作,又能对列操作,lapply不需要指定行列,默认对列进行操作

head(data.frame(lapply(sim.dat1[,c(1,3,5)],log)))

age income store_exp
1 4.043051 11.70324 6.271242
2 4.143135 11.71184 6.169623
3 4.077537 11.64573 6.196059
4 4.094345 11.64058 5.851653
5 3.931826 11.73007 5.939186
6 4.077537 11.58675 5.823979
>

分位数检验:可根据业务逻辑,判定高于或者低于某个分位数的值为异常并进行处理

> quantile(sim.dat1$income,0.005,rm.na=T)
    0.5% 
51047.79 
> quantile(sim.dat1$income,0.999,rm.na=T)
   99.9% 
317478.4 
#将收入小于0.5%的值且不缺失的值填充为0.5%对应的值
>sim.dat1$income[sim.dat1$income < quantile(sim.dat1$income,0.005,na.rm = T) & !is.na(sim.dat1$income)] <-51047.79
#将收入高于99.9%且不为缺失的值赋值为99.9%对应的值
>sim.dat1$income[sim.dat1$income > quantile(sim.dat1$income,0.999,na.rm = T) & !is.na(sdat$income)] <-317478.4

 二、共线性检测

> library(corrplot)#去除类别变量
> corrplot.mixed(cor(sim.dat1[,-c(2,4,19)]),order='hclust',upper = 'square')

寻找相关性较高的列:

> names(sim.dat1)[findCorrelation(cor(sim.dat1[, - c(2, 4, 19)]), cutoff = 0.8)]#找出相关系数大于0.8的并删除
[1] "Q3"           "age"          "Q5"           "Q8"           "online_exp"  
[6] "income"       "online_trans"

三、稀疏变量:直接删除

在原数据基础上构造一个稀疏变量值全为0,并且合并到原变量里

> zero1<-rep(1,nrow(sim.dat1))> sim.dat1<-cbind(sim.dat1,zero1)
> summary(sim.dat1)
多了一列这个

> nearZeroVar(sim.dat1, freqCut =95/5, uniqueCut = 10)
[1] 20

>sim.dat1 <- sim.dat1[,-nearZeroVar(sim.dat1,freqCut = 95/5,uniqueCut = 2)]#删除20列
nearZeroVar(x,freqCut,uniqueCut)
  • x:数值类型,numeric vector,matrix,data frame
  • freqCut:第一众数与第二众数的比率的cutoff(临界值)
  • uniqueCut:剔重后的唯一值 与 样本总数量的百分比 (上例为 95/5),大于这个值不会被剔除

名义变量:由于是ABCD类别不能进行运算,变成0和1的哑变量,便于应用在后续计算中

单一哑变量

 >head(predict(dummyVars(~.,data = SegData),SegData,levelsOnly = F))# 用原变量名加上因子层级的名称作为新的名义变量名

交互哑变量

 

head(predict(dummyVars(~gender+house+income+income:gender,
                       data = SegData,
                       levelsOnly = F),SegData))

Rdata数据存储读取

> save.image('data_preprocessing.RData')
> load('data_preprocessing.RData')

 

posted @ 2017-07-13 13:05  积水成渊数据分析  阅读(416)  评论(0编辑  收藏  举报