代码3-1 餐饮销量额数据缺失值及异常值检测代码
1 # 设置工作空间 2 # 把“数据及程序”文件夹拷贝到F盘下,再用setwd设置工作空间 3 setwd("D:/R_Project/book1_R/chapter3/示例程序") 4 # 读入数据 5 saledata <- read.csv(file = "./data/catering_sale.csv", header = TRUE) 6 7 # 缺失值检测 并打印结果,由于R把TRUE和FALSE分别当作1、0,可以用sum()和mean()函数来分别获取缺失样本数、缺失比例 8 sum(complete.cases(saledata)) 9 sum(!complete.cases(saledata)) 10 mean(!complete.cases(saledata)) 11 saledata[!complete.cases(saledata), ] 12 13 # 异常值检测箱线图 14 # boxwex :a scale factor to be applied to all boxes.也就是调整箱子的比例 15 sp <- boxplot(saledata$"销量", boxwex = 0.7) 16 #给箱线图添加标题 17 title("销量异常值检测箱线图") 18 19 xi <- 1.1 20 21 #计算销量的标准差 22 sd.s <- sd(saledata[complete.cases(saledata), ]$"销量") 23 24 #计算销量的平均值 25 mn.s <- mean(saledata[complete.cases(saledata), ]$"销量") 26 points(xi, mn.s, col = "red", pch = 18) 27 arrows(xi, mn.s - sd.s, xi, mn.s + sd.s, code = 3, col = "pink", angle = 75, length = .1) 28 text(rep(c(1.05, 1.05, 0.95, 0.95), length = length(sp$out)), 29 labels = sp$out[order(sp$out)], sp$out[order(sp$out)] + 30 rep(c(150, -150, 150, -150), length = length(sp$out)), col = "red")
Notes:
(1)
(2)
* R语言用complete.cases 和 na.omit去除有空值的行:http://blog.sina.com.cn/s/blog_59990a450101qnvy.html
* complete.cases()函数:Return a logical vector indicating which cases are complete, i.e., have no missing values.
* 也就是说它返回的是一个TRUE/FALSE的逻辑向量
(3)
* points()函数:用于标记某个点,设定参数pch即标记该点要用什么形状的来标记,pch=20是实心圆形状;设定参数cex表示这个形状的大小设定为多少,一般cex=2就够了;参数col设定点的颜色
* pch
* plotting ‘character’, i.e., symbol to use. This can either be a single character or an integer code for one of a set of graphics symbols.The full set of S symbols is available with pch = 0:18, see the examples below. (NB: R uses circles instead of the octagons used in S.) Value pch = "." (equivalently pch = 46) is handled specially. It is a rectangle of side 0.01 inch (scaled by cex). In addition, if cex = 1 (the default), each side is at least one pixel (1/72 inch on the pdf, postscript and xfig devices). For other text symbols, cex = 1 corresponds to the default fontsize of the device, often specified by an argument pointsize. For pch in 0:25 the default size is about 75% of the character height (see par("cin")).
* cex
* character (or symbol) expansion: a numerical vector. This works as a multiple of par("cex").
(4)
* arrows()函数:用于在图像画箭头的函数
* 定义:arrows(x0, y0, x1 = x0, y1 = y0, length = 0.25, angle = 30,code = 2, col = par("fg"), lty = par("lty"),lwd = par("lwd"), ...)
* Argument:
* x0, y0
* coordinates of points from which to draw.
* x1, y1
* coordinates of points to which to draw. At least one must the supplied
* length
* length of the edges of the arrow head (in inches).
* angle
* angle from the shaft of the arrow to the edge of the arrow head.(从箭头的轴到箭头的边缘的角度。)
* code
* integer code, determining kind of arrows to be drawn.
* col, lty, lwd
* graphical parameters, possible vectors. NA values in col cause the arrow to be omitted.
箱线图分析:
代码3-2 餐饮销量额数据缺失值及异常值检测代码
1 #设置工作空间 2 #把"数据及程序"文件夹复制到D盘下,再用setwd设置工作空间 3 setwd("D:/R_Project/Practice_book1/") 4 5 #读入数据 6 saledata = read.table(file = "./chapter03/catering_sale.csv",sep = ",",header = T) 7 8 sales = saledata[,2] 9 10 #统计分析 11 #参数na.rm设置为TRUE,表示操作数据时,遇到NA不管 12 13 #均值 14 mean_ <- mean(sales,na.rm = T) 15 16 #中位数 17 median_ <- median(sales,na.rm = T) 18 19 #极差 20 range_ <- max(sales,na.rm = T)-min(sales,na.rm = T) 21 22 #标准差 23 std_ <- sqrt(var(sales,na.rm = T)) 24 25 #变异系数 26 variation_ <- std_/mean_ 27 28 #四分位数间距 29 q1 <- quantile(sales,0.25,na.rm = T) 30 q3 <- quantile(sales,0.75,na.rm = T) 31 distance <- q3-q1 32 33 #将这些数据特征整合在一个矩阵中 34 #参数byrow 在matrix()函数中默认设置为FALSE即按列设置矩阵;byrow = TRUE表示 35 a <- matrix(c(mean_,median_,range_,std_,variation_,q1,q3,distance),1,byrow = T) 36 #给矩阵设置指定的列名 37 colnames(a) <- c("均值","中位数","极差","标准差","变异系数","1/4分位数","3/4分位数","四分位数间距") 38 39 print(a)
代码3-3 菜品盈利帕累图代码
1 setwd("D:/R_Project/Practice_book1/chapter03/") 2 3 #读取菜品数据,绘制帕累托图 4 5 dishdata <- read.csv(file = "./catering_dish_profit.csv") 6 barplot(dishdata[, 3], col = "blue1", names.arg = dishdata[, 2], width = 1, 7 space = 0, ylim = c(0, 10000), xlab = "菜品", ylab = "盈利:元") 8 accratio <- dishdata[, 3] 9 for ( i in 1:length(accratio)) { 10 accratio[i] <- sum(dishdata[1:i, 3]) / sum(dishdata[, 3]) 11 } 12 13 par(new = FALSE, mar = c(4, 4, 4, 4)) 14 points(accratio * 10000 ~ c((1:length(accratio) - 0.5)), type = "b") 15 axis(4, col = "red", col.axis = "red", at = 0:10000, label = c(0:10000 / 10000)) 16 mtext("累积百分比", 4, 2) 17 18 points(6.5, accratio[7] * 10000, col="red") 19 text(7.3, accratio[7] * 10000-200,paste(round(accratio[7] + 0.00001, 4) * 100, "%"))
代码3-4 餐饮销量数据相关性分析
1 # 餐饮销量数据相关性分析 2 # 设置工作空间 3 # 把“数据及程序”文件夹拷贝到F盘下,再用setwd设置工作空间 4 setwd("D:/R_Project/book1_R/chapter3/示例程序") 5 # 读取数据 6 cordata <- read.csv(file = "./data/catering_sale_all.csv", header = TRUE) 7 # 求出相关系数矩阵 8 cor(cordata[, 2:11])
Notes:
|r|<=0.3 为极弱线性相关或不存在线性相关
0.3<|r|<=0.5 为低度线性相关
0.5<|r|<=0.8为显著线性相关
|r|>0.8 为高度线性相关