R语言6进阶
生信技能树R语言部分学习笔记
1. tidyr包能做什么事情?
1.1 数据清理
tidyr的扁和长
数据的两种组织方式:扁和长,二者包含的信息相同,可以相互转换(用tidyr中的函数:gather-spread)
举例:
扁数据
(1) 扁变长
test_gather <- gather(data = test,
key = sample_nm,
value = exp,
- geneid) ##表示选择除了geneid这一列的列
head(test_gather)
gather()
函数已经被新的函数替代了:df %>% gather("key", "value", x, y, z)
等价于 df %>% pivot_longer(c(x, y, z), names_to = "key", values_to = "value")
pivot_longer(data = test,
names_to = "sample_nm",
values_to = "value",
-geneid)
(2)长变扁
test_re <- spread(data = test_gather,
key = sample_nm,
value = exp)
head(test_re)
同样的该函数spread()
也被别的函数所替代:df %>% spread(key, value)
等价于 df %>% pivot_wider(names_from = key, values_from = value)
pivot_wider(data = test_gather,
names_from = sample_nm,
values_from = exp)
1.2 分割和合并
tidyr的分和合
采用的函数:separate
和unite
原始数据:
test <- data.frame(x = c( "a,b", "a,d", "b,c"));test
分割:
test_seprate <- separate(test,x, c("X", "Y"),sep = ",");test_seprate
合并:
test_re <- unite(test_seprate,"x",X,Y,sep = ",");test_re
1.3 处理NA
原始数据:
(1)去掉含有NA的行,可以选择只根据某一列来去除
drop_na(X) #把所有带有NA的行都删掉,功能与na.omit()相同
drop_na(X,X1) #按照第一列来去除NA的行
drop_na(X,X2)
(2)替换NA
> replace_na(X$X2,0) #将第二列中的NA的值用0替换,只是打印出来没有实际赋值操作
[1] 1 0 3 4 5
(3)用上一行的值填充NA
fill(X,X2)
2. dplyr包的功能
基础:
dplyr包是专门用来处理数据框。
原始数据:
2.1 mutate()
新增列
mutate(test, new = Sepal.Length * Sepal.Width)
2.2 select()
按列取子集
(1)按列号筛选
select(test,1) #选出原数据的第一列
select(test,c(1,5)) #选出原数据的第一、五列
(2)按列名筛选
select(test,Sepal.Length)
select(test, Petal.Length, Petal.Width)
vars <- c("Petal.Length", "Petal.Width") #与上面的代码等价
select(test, one_of(vars))
(3)一组来自tidyselect的有用函数
select(test, starts_with("Petal")) #选择以Petal为前缀的列
select(test, ends_with("Width")) #选择以Width为后缀的列
select(test, contains("etal")) #选择包含etal的列
select(test, matches(".t.")) #选择包含.t.的列
select(test, everything()) #选择所有的变量
select(test, last_col()) #选择最后一个变量
select(test, last_col(offset = 1)) #选择倒数第一个变量
(4)利用everything()
,列名可以重排序
select(test,Species,everything())
2.3 filter()
按行取子集
filter(test, Species == "setosa")
filter(test, Species == "setosa"&Sepal.Length > 5 )
filter(test, Species %in% c("setosa","versicolor"))
2.4 arrange()
按照数据框的某一列来给整个数据框进行排序
arrange(test, Sepal.Length)#默认从小到大排序
arrange(test, desc(Sepal.Length))#用desc从大到小
arrange(test, desc(Sepal.Width),Sepal.Length)
2.5 summarise()
汇总, 通常和分组一起使用
library(dplyr)
summarise(iris, k = mean(Sepal.Length)) #汇总总的数据
summarise(group_by(iris, Species),
k = mean(Sepal.Length)) #分组汇总
用管道符号表示为:
iris %>% group_by(Species) %>% summarise(k = mean(Sepal.Length))
对数据进行汇总操作,结合group_by使用实用性强
summarise(test, mean(Sepal.Length), sd(Sepal.Length))# 计算Sepal.Length的平均值和标准差
先按照Species分组,计算每组Sepal.Length的平均值和标准差
group_by(test, Species)
tmp = summarise(group_by(test, Species),mean(Sepal.Length), sd(Sepal.Length))
进阶:
2.6 count()
count(test,Species) #结果是一个tibble,特殊数据框
table(iris$Species) #结果是一个table
2.7 管道操作 %>%
(ctrl+shift+M) :上一步的输出作为下一步的输入
library(dplyr)
x1 = filter(iris,Sepal.Width>3) #筛选行
x2 = select(x1,c("Sepal.Length","Sepal.Width" )) #筛选列
x3 = arrange(x2,Sepal.Length) #按照某一列进行排序
与下面的代码等价:
iris %>%
filter(Sepal.Width>3) %>%
select(c("Sepal.Length","Sepal.Width" ))%>%
arrange(Sepal.Length)
2.8 处理关系数据:即将2个表进行连接,注意:不要引入factor
原数据:
将test1和test2合并:
merge(test1,test2,by="name")
将test1和test3合并:
merge(test1,test3,by.x = "name",by.y = "NAME")
通过函数merge()
合并的数据只能取交集。
(1)內连inner_join,取交集(与merge
功能相同)
inner_join(test1, test2, by = "name")
inner_join(test1,test3,by = c("name"="NAME"))
(2)左连left_join
left_join(test1, test2, by = 'name') #按照test1合并数据
left_join(test2, test1, by = 'name') #按照test2合并数据
(3)全连full_join
full_join(test1, test2, by = 'name') #按照test1和test2并集合并
(4)半连接:返回能够与y表匹配的x表所有记录semi_join
semi_join(x = test1, y = test2, by = 'name')
(5)反连接:返回无法与y表匹配的x表的所记录anti_join
anti_join(x = test2, y = test1, by = 'name')
2.9 数据的简单合并
在相当于base包里的cbind()函数和rbind()函数;注意,bind_rows()函数需要两个表格列数相同,而bind_cols()函数则需要两个数据框有相同的行数
原数据:
bind_rows(test1, test2)
bind_cols(test1, test3)
3. stringr包
字符串:
3.1 检测字符串长度
> length(x)
[1] 1
> str_length(x) #字符个数
[1] 42
3.2 字符串拆分与组合
> str_split(x," ") #x可以是多个字符串组成的向量
[[1]]
[1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
> y <- sentences[1:3]
> str_split(y, " ")
[[1]]
[1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
[[2]]
[1] "Glue" "the" "sheet" "to" "the" "dark"
[7] "blue" "background."
[[3]]
[1] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well."
> str_split(y, " ", simplify = T) #list简化之后是一个矩阵
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks." ""
[2,] "Glue" "the" "sheet" "to" "the" "dark" "blue" "background." ""
[3,] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well."
> x2 = str_split(x," ")[[1]]
> str_c(x2,collapse = " ") #内部连接
[1] "The birch canoe slid on the smooth planks."
> str_c(x2,1234,sep = "+") #外部连接
[1] "The+1234" "birch+1234" "canoe+1234" "slid+1234" "on+1234"
[6] "the+1234" "smooth+1234" "planks.+1234"
3.3 提取字符串的一部分
> str_sub(x,5,9)
[1] "birch"
3.4 大小写转换
> str_to_upper(x2)
[1] "THE" "BIRCH" "CANOE" "SLID" "ON" "THE" "SMOOTH" "PLANKS."
> str_to_lower(x2)
[1] "the" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
> str_to_title(x2)
[1] "The" "Birch" "Canoe" "Slid" "On" "The" "Smooth" "Planks."
3.5 字符串排序
> str_sort(x2)
[1] "birch" "canoe" "on" "planks." "slid" "smooth" "the" "The"
3.6 字符检测
> str_detect(x2,"h") #********* 检验每个字符中是否含有h这个字母,含有返回TRUE,否则FALSE
[1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
> str_starts(x2,"T")
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> str_ends(x2,"e")
[1] TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
与sum和mean连用,可以统计匹配的个数和比例
sum(str_detect(x2,"h"))
mean(str_detect(x2,"h"))
3.7 提取匹配到的字符串
> str_subset(x2,"h")
[1] "The" "birch" "the" "smooth"
> x2[str_detect(x2,"h")] #等价
[1] "The" "birch" "the" "smooth"
3.8 字符计数
> str_count(x," ")
[1] 7
> x2
[1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
> str_count(x2,"o")
[1] 0 0 1 0 1 0 2 0
3.9 字符串替换
> x2
[1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
> ###9.字符串替换
> str_replace(x2,"o","A") #只替换第一个o
[1] "The" "birch" "canAe" "slid" "An" "the" "smAoth" "planks."
> str_replace_all(x2,"o","A") #全部替换
[1] "The" "birch" "canAe" "slid" "An" "the" "smAAth" "planks."
4.条件语句和循环语句
4.1 if语句
4.1.1 if(){}
(1) 只有if没有else,那么条件是FALSE时就什么都不做
(2) 有else
> i =1
> if (i>0){
+ cat('+') #cat()与print()的区别,cat()打印出本来的样子,print()打印出向量的第一个元素
+ } else {
+ print("-")
+ }
+
上面语句与下面等价:
> ifelse(i>0,"+","-")
[1] "+"
(3)多个条件
> i = 0
> if (i>0){
+ print('+')
+ } else if (i==0) {
+ print('0')
+ } else if (i< 0){
+ print('-')
+ }
[1] "0"
与下面语句等价:
> ifelse(i>0,
+ "+",
+ ifelse((i<0),
+ "-",
+ "0"))
[1] "0"
case_when()
dplyr
中函数,可以多个嵌套使用
4.1.2 switch()
> require(stats)
> centre <- function(x, type) {
+ switch(type,
+ mean = mean(x),
+ median = median(x),
+ trimmed = mean(x, trim = .1))
+ }
> x <- rcauchy(10)
> centre(x, "mean")
[1] -1.441392
> centre(x, "median")
[1] -0.7271923
> centre(x, "trimmed")
[1] -1.310942
为什么匹配的是cc这个值?
> cd = 3
> foo <- switch(EXPR = cd,
+ #EXPR = "aa",
+ aa=c(3.4,1),
+ bb=matrix(1:4,2,2),
+ cc=matrix(c(T,T,F,T,F,F),3,2),
+ dd="string here",
+ ee=matrix(c("red","green","blue","yellow")))
> foo
[,1] [,2]
[1,] TRUE TRUE
[2,] TRUE FALSE
[3,] FALSE FALSE
4.2 循环语句
4.2.1 for
循环
顺便看一下next
和break
do.call(cbind,result)
#按列组合列表中的每一个元素
4.2.2 while
循环
4.2.3 repeat
语句
注意必须要有break
next
结束本轮循环
break
结束循环
5. apply()
族函数
5.1 apply
处理矩阵或数据框
apply(X, MARGIN, FUN, …)
其中X是数据框/矩阵名;
MARGIN为1表示取行,为2表示取列,FUN是函数
5.2 lapply(list, FUN,...)
对列表/向量中的每个元素(向量)实施相同的操作, 返回值是一个列表
5.3 sapply
处理列表,简化结果,直接返回矩阵和向量
sapply(X, FUN, …) 注意和lapply的区别,返回值不一样