R中的apply族函数和多线程计算
一.apply族函数
1.apply 应用于矩阵和数组
# apply # 1代表行,2代表列 # create a matrix of 10 rows x 2 columns m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2) # mean of the rows apply(m, 1, mean) [1] 6 7 8 9 10 11 12 13 14 15 # mean of the columns apply(m, 2, mean) [1] 5.5 15.5 # divide all values by 2 apply(m, 1:2, function(x) x/2)
2.eapply 应用于环境中的变量
# a new environment e <- new.env() # two environment variables, a and b e$a <- 1:10 e$b <- 11:20 # mean of the variables eapply(e, mean) $b [1] 15.5 $a [1] 5.5
3.lapply应用于列表,返回列表,实际data.frame也是一种list,一种由多个长度相同的向量cbind一起的list:lapply(list, function)
sapply(iris[,1:4],mean) Sepal.Length Sepal.Width Petal.Length Petal.Width 5.843333 3.057333 3.758000 1.199333 lapply(iris[,1:4],mean) $Sepal.Length [1] 5.843333 $Sepal.Width [1] 3.057333 $Petal.Length [1] 3.758 $Petal.Width [1] 1.199333
4.sapply 是lapply的友好形式.lapply和sapply都可应用于list,data.frame。只是返回的对象类型不一样,前者是list,后者看情况,如果是每一个list下面的元素长度都一样,返回的结果就会被就会简化。举例说明。
# 下面两个返回的结果是一样一样的,都是list sapply(iris,unique) lapply(iris,unique) # 下面两个前者返回向量,后者返回list sapply(iris[,1:4],mean) lapply(iris[,1:4],mean) #下面两个前者返回data.frame,后者反回list sapply(iris[,1:4], function(x) x/2) lapply(iris[,1:4], function(x) x/2) # sapply会根据返回结果,选最合适的对象类型来存放对象,而list反悔的统统都是list # 以下两者返回结果一样 library(magrittr) lapply(iris[,1:4],mean)%>%unlist() sapply(iris[,1:4],mean)
5.vapply要求提供第三个参数,即输出的格式
l <- list(a = 1:10, b = 11:20) # fivenum of values using vapply l.fivenum <- vapply(l, fivenum, c(Min.=0, "1st Qu."=0, Median=0, "3rd Qu."=0, Max.=0)) class(l.fivenum) [1] "matrix" # let's see it l.fivenum a b Min. 1.0 11.0 1st Qu. 3.0 13.0 Median 5.5 15.5 3rd Qu. 8.0 18.0 Max. 10.0 20.0
6.replicate
Description: “replicate is a wrapper for the common use of sapply for repeated evaluation of an expression (which will usually involve random number generation).”
replicate(10, rnorm(10))
7.mapply可传递多个参数进去.
mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on. Arguments are recycled if necessary.
l1 <- list(a = c(1:10), b = c(11:20)) l2 <- list(c = c(21:30), d = c(31:40)) # sum the corresponding elements of l1 and l2 mapply(sum, l1$a, l1$b, l2$c, l2$d) [1] 64 68 72 76 80 84 88 92 96 100 #mapply像是可以传递多个参数的saply mapply(rep, 1:4, 5) [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 1 2 3 4 [3,] 1 2 3 4 [4,] 1 2 3 4 [5,] 1 2 3 4
8.rapply
Description: “rapply is a recursive version of lapply.”
# let's start with our usual simple list example l <- list(a = 1:10, b = 11:20) # log2 of each value in the list rapply(l, log2) a1 a2 a3 a4 a5 a6 a7 a8 0.000000 1.000000 1.584963 2.000000 2.321928 2.584963 2.807355 3.000000 a9 a10 b1 b2 b3 b4 b5 b6 3.169925 3.321928 3.459432 3.584963 3.700440 3.807355 3.906891 4.000000 b7 b8 b9 b10 4.087463 4.169925 4.247928 4.321928 # log2 of each value in each list rapply(l, log2, how = "list") $a [1] 0.000000 1.000000 1.584963 2.000000 2.321928 2.584963 2.807355 3.000000 [9] 3.169925 3.321928 $b [1] 3.459432 3.584963 3.700440 3.807355 3.906891 4.000000 4.087463 4.169925 [9] 4.247928 4.321928 # what if the function is the mean? rapply(l, mean) a b 5.5 15.5 rapply(l, mean, how = "list") $a [1] 5.5 $b [1] 15.5
二.多线程计算
下面用欧拉问题14,来演示R中的向量化编程(利用apply组函数)和多线程
#-----Longest Collatz sequence Problem 14 func <- function(x) { n = 1 raw <- x while (x > 1) { x <- ifelse(x%%2==0,x/2,3*x+1) n = n + 1 } return(c(raw,n)) } #方法1 向量化编程 library(magrittr) system.time({ x <- 1:1e5 res1 <- sapply(x, func)%>%t() }) 用户 系统 流逝 37.960 0.360 41.315 #方法2 向量化编程 system.time({ x <- 1:1e5 res2 <- do.call('rbind',lapply(x,func)) }) 用户 系统 流逝 36.031 0.181 36.769 #方法3 多线程计算 library(parallel) # 用system.time来返回计算所需时间 system.time({ x <- 1:1e5 cl <- makeCluster(4) # 初始化四核心集群 results <- parLapply(cl,x,func) # lapply的并行版本 res.df <- do.call('rbind',results) # 整合结果 stopCluster(cl) # 关闭集群 }) 用户 系统 流逝 0.199 0.064 20.038 # 方法4 for 循环 system.time({ m <- matrix(nrow = 0,ncol = 2) for(i in 1:1e5){ m <- rbind(m,func(i)) } }) #方法4用时太长
以上。
参考: