R数据预处理(二)
一、数据变换
中心化、标准化原数据:
> summary(sim.dat1) age gender income house store_exp online_exp Min. :16.00 Female:554 Min. : 41776 No :432 Min. :155.8 Min. : 68.82 1st Qu.:25.00 Male :446 1st Qu.: 87896 Yes:568 1st Qu.:205.1 1st Qu.: 420.34 Median :36.00 Median : 93869 Median :329.8 Median :1941.86 Mean :38.58 Mean :109923 Mean :373.1 Mean :2120.18 3rd Qu.:53.00 3rd Qu.:119456 3rd Qu.:597.2 3rd Qu.:2440.78 Max. :69.00 Max. :319704 Max. :597.3 Max. :9479.44 store_trans online_trans Q1 Q2 Q3 Min. : 1.00 Min. : 1.00 Min. :1.000 Min. :1.000 Min. :1.000 1st Qu.: 3.00 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 Median : 4.00 Median :14.00 Median :3.000 Median :1.000 Median :1.000 Mean : 5.35 Mean :13.55 Mean :3.101 Mean :1.823 Mean :1.992 3rd Qu.: 7.00 3rd Qu.:20.00 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000 Max. :20.00 Max. :36.00 Max. :5.000 Max. :5.000 Max. :5.000 Q4 Q5 Q6 Q7 Q8 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 1st Qu.:2.000 1st Qu.:1.750 1st Qu.:1.000 1st Qu.:2.500 1st Qu.:1.000 Median :3.000 Median :4.000 Median :2.000 Median :4.000 Median :2.000 Mean :2.763 Mean :2.945 Mean :2.448 Mean :3.434 Mean :2.396 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:3.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Q9 Q10 segment Min. :1.000 Min. :1.00 Conspicuous:200 1st Qu.:2.000 1st Qu.:1.00 Price :250 Median :4.000 Median :2.00 Quality :200 Mean :3.085 Mean :2.32 Style :350 3rd Qu.:4.000 3rd Qu.:3.00 Max. :5.000 Max. :5.00
> standard1=preProcess(sim.dat1,method=c('center','scale'))#变量减去均值,再除以标准差
> head(predict(standard1,sim.dat1)) age gender income house store_exp online_exp store_trans online_trans Q1 1 1.2989557 Female 0.24179440 Yes 0.93607103 -1.049355 -0.9064934 -1.451057 0.6199408 2 1.7219764 Female 0.26467429 Yes 0.62930797 -1.161404 -0.3653033 -1.451057 0.6199408 3 1.4399626 Male 0.09372058 Yes 0.70613556 -1.063370 0.4464818 -1.451057 1.3095300 4 1.5104660 Male 0.08088761 Yes -0.15185121 -1.142839 1.2582670 -1.451057 1.3095300 5 0.8759349 Male 0.31382955 Yes 0.03904516 -1.159840 -0.3653033 -1.199705 0.6199408 6 1.4399626 Male -0.04952922 Yes -0.20881124 -1.111638 -0.3653033 -1.074028 0.6199408
log变换:
apply(sim.dat1,1,log)
#语法:apply(数据框,行列标识,使用的功能函数自己定义的功能函数也可以)
> apply(sim.dat1[,c(1,3,5)],1,log)#上列中性别为类别变量,需要指定非类别变量来进行log
apply 既能对行操作,又能对列操作,lapply不需要指定行列,默认对列进行操作
head(data.frame(lapply(sim.dat1[,c(1,3,5)],log)))
age income store_exp
1 4.043051 11.70324 6.271242
2 4.143135 11.71184 6.169623
3 4.077537 11.64573 6.196059
4 4.094345 11.64058 5.851653
5 3.931826 11.73007 5.939186
6 4.077537 11.58675 5.823979
>
分位数检验:可根据业务逻辑,判定高于或者低于某个分位数的值为异常并进行处理
> quantile(sim.dat1$income,0.005,rm.na=T) 0.5% 51047.79 > quantile(sim.dat1$income,0.999,rm.na=T) 99.9% 317478.4
#将收入小于0.5%的值且不缺失的值填充为0.5%对应的值
>sim.dat1$income[sim.dat1$income < quantile(sim.dat1$income,0.005,na.rm = T) & !is.na(sim.dat1$income)] <-51047.79
#将收入高于99.9%且不为缺失的值赋值为99.9%对应的值
>sim.dat1$income[sim.dat1$income > quantile(sim.dat1$income,0.999,na.rm = T) & !is.na(sdat$income)] <-317478.4
二、共线性检测
> library(corrplot)#去除类别变量 > corrplot.mixed(cor(sim.dat1[,-c(2,4,19)]),order='hclust',upper = 'square')
寻找相关性较高的列:
> names(sim.dat1)[findCorrelation(cor(sim.dat1[, - c(2, 4, 19)]), cutoff = 0.8)]#找出相关系数大于0.8的并删除 [1] "Q3" "age" "Q5" "Q8" "online_exp" [6] "income" "online_trans"
三、稀疏变量:直接删除
在原数据基础上构造一个稀疏变量值全为0,并且合并到原变量里
> zero1<-rep(1,nrow(sim.dat1))> sim.dat1<-cbind(sim.dat1,zero1)
> summary(sim.dat1)
多了一列这个
> nearZeroVar(sim.dat1, freqCut =95/5, uniqueCut = 10)
[1] 20
>sim.dat1 <- sim.dat1[,-nearZeroVar(sim.dat1,freqCut = 95/5,uniqueCut = 2)]#删除20列
nearZeroVar(x,freqCut,uniqueCut)
- x:数值类型,numeric vector,matrix,data frame
- freqCut:第一众数与第二众数的比率的cutoff(临界值)
- uniqueCut:剔重后的唯一值 与 样本总数量的百分比 (上例为 95/5),大于这个值不会被剔除
名义变量:由于是ABCD类别不能进行运算,变成0和1的哑变量,便于应用在后续计算中
单一哑变量
>head(predict(dummyVars(~.,data = SegData),SegData,levelsOnly = F))# 用原变量名加上因子层级的名称作为新的名义变量名
交互哑变量
head(predict(dummyVars(~gender+house+income+income:gender, data = SegData, levelsOnly = F),SegData))
Rdata数据存储读取
> save.image('data_preprocessing.RData') > load('data_preprocessing.RData')