R语言6进阶

生信技能树R语言部分学习笔记

1. tidyr包能做什么事情?

1.1 数据清理

tidyr的扁和长
数据的两种组织方式:扁和长,二者包含的信息相同,可以相互转换(用tidyr中的函数:gather-spread)
举例:
扁数据

(1) 扁变长

test_gather <- gather(data = test,
                    key = sample_nm,
                    value = exp,
                    - geneid)     ##表示选择除了geneid这一列的列
head(test_gather)

gather()函数已经被新的函数替代了:df %>% gather("key", "value", x, y, z) 等价于 df %>% pivot_longer(c(x, y, z), names_to = "key", values_to = "value")

pivot_longer(data = test,
             names_to = "sample_nm", 
             values_to = "value",
             -geneid)

(2)长变扁

test_re <- spread(data = test_gather,
                key = sample_nm,
                value = exp)
head(test_re)

同样的该函数spread()也被别的函数所替代:df %>% spread(key, value) 等价于 df %>% pivot_wider(names_from = key, values_from = value)

pivot_wider(data = test_gather,
            names_from = sample_nm,
            values_from = exp)

1.2 分割和合并

tidyr的分和合
采用的函数:separateunite

原始数据:

test <- data.frame(x = c( "a,b", "a,d", "b,c"));test

分割:

test_seprate <- separate(test,x, c("X", "Y"),sep = ",");test_seprate

合并:

test_re <- unite(test_seprate,"x",X,Y,sep = ",");test_re

1.3 处理NA

原始数据:

(1)去掉含有NA的行,可以选择只根据某一列来去除

drop_na(X)      #把所有带有NA的行都删掉,功能与na.omit()相同

drop_na(X,X1)    #按照第一列来去除NA的行

drop_na(X,X2)

(2)替换NA

> replace_na(X$X2,0)    #将第二列中的NA的值用0替换,只是打印出来没有实际赋值操作
[1] 1 0 3 4 5

(3)用上一行的值填充NA

fill(X,X2)

2. dplyr包的功能

基础:
dplyr包是专门用来处理数据框。

原始数据:

2.1 mutate() 新增列

mutate(test, new = Sepal.Length * Sepal.Width)

2.2 select() 按列取子集

(1)按列号筛选

select(test,1)     #选出原数据的第一列
select(test,c(1,5))  #选出原数据的第一、五列

(2)按列名筛选

select(test,Sepal.Length)
select(test, Petal.Length, Petal.Width)
vars <- c("Petal.Length", "Petal.Width")    #与上面的代码等价
select(test, one_of(vars))

(3)一组来自tidyselect的有用函数

select(test, starts_with("Petal"))     #选择以Petal为前缀的列
select(test, ends_with("Width"))       #选择以Width为后缀的列
select(test, contains("etal"))         #选择包含etal的列
select(test, matches(".t."))           #选择包含.t.的列
select(test, everything())             #选择所有的变量
select(test, last_col())               #选择最后一个变量
select(test, last_col(offset = 1))     #选择倒数第一个变量

(4)利用everything(),列名可以重排序

select(test,Species,everything())

2.3 filter() 按行取子集

filter(test, Species == "setosa")
filter(test, Species == "setosa"&Sepal.Length > 5 )
filter(test, Species %in% c("setosa","versicolor"))

2.4 arrange() 按照数据框的某一列来给整个数据框进行排序

arrange(test, Sepal.Length)#默认从小到大排序
arrange(test, desc(Sepal.Length))#用desc从大到小
arrange(test,  desc(Sepal.Width),Sepal.Length)

2.5 summarise() 汇总, 通常和分组一起使用

library(dplyr)
summarise(iris, k = mean(Sepal.Length))   #汇总总的数据
summarise(group_by(iris, Species),
          k = mean(Sepal.Length))     #分组汇总

用管道符号表示为:

iris %>% group_by(Species) %>% summarise(k = mean(Sepal.Length))

对数据进行汇总操作,结合group_by使用实用性强

summarise(test, mean(Sepal.Length), sd(Sepal.Length))# 计算Sepal.Length的平均值和标准差

先按照Species分组,计算每组Sepal.Length的平均值和标准差

group_by(test, Species)
tmp = summarise(group_by(test, Species),mean(Sepal.Length), sd(Sepal.Length))

进阶:

2.6 count()

count(test,Species)   #结果是一个tibble,特殊数据框
table(iris$Species)   #结果是一个table

2.7 管道操作 %>% (ctrl+shift+M) :上一步的输出作为下一步的输入

library(dplyr)
x1 = filter(iris,Sepal.Width>3)      #筛选行
x2 = select(x1,c("Sepal.Length","Sepal.Width" ))   #筛选列
x3 = arrange(x2,Sepal.Length)     #按照某一列进行排序

与下面的代码等价:

iris %>% 
  filter(Sepal.Width>3) %>% 
  select(c("Sepal.Length","Sepal.Width" ))%>%
  arrange(Sepal.Length)

2.8 处理关系数据:即将2个表进行连接,注意:不要引入factor

原数据:

将test1和test2合并:

merge(test1,test2,by="name")

将test1和test3合并:

merge(test1,test3,by.x = "name",by.y = "NAME")

通过函数merge()合并的数据只能取交集。

(1)內连inner_join,取交集(与merge功能相同)

inner_join(test1, test2, by = "name")
inner_join(test1,test3,by = c("name"="NAME"))

(2)左连left_join

left_join(test1, test2, by = 'name')   #按照test1合并数据

left_join(test2, test1, by = 'name')  #按照test2合并数据

(3)全连full_join

full_join(test1, test2, by = 'name')  #按照test1和test2并集合并

(4)半连接:返回能够与y表匹配的x表所有记录semi_join

semi_join(x = test1, y = test2, by = 'name')

(5)反连接:返回无法与y表匹配的x表的所记录anti_join

anti_join(x = test2, y = test1, by = 'name')

2.9 数据的简单合并

在相当于base包里的cbind()函数和rbind()函数;注意,bind_rows()函数需要两个表格列数相同,而bind_cols()函数则需要两个数据框有相同的行数

原数据:

bind_rows(test1, test2)

bind_cols(test1, test3)

3. stringr包

字符串:

3.1 检测字符串长度

> length(x)
[1] 1
> str_length(x)   #字符个数
[1] 42

3.2 字符串拆分与组合

> str_split(x," ")     #x可以是多个字符串组成的向量
[[1]]
[1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth"  "planks."

> y <- sentences[1:3]
> str_split(y, " ")
[[1]]
[1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth"  "planks."

[[2]]
[1] "Glue"        "the"         "sheet"       "to"          "the"         "dark"       
[7] "blue"        "background."

[[3]]
[1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."

> str_split(y, " ", simplify = T)  #list简化之后是一个矩阵
     [,1]   [,2]    [,3]    [,4]   [,5]  [,6]    [,7]     [,8]          [,9]   
[1,] "The"  "birch" "canoe" "slid" "on"  "the"   "smooth" "planks."     ""     
[2,] "Glue" "the"   "sheet" "to"   "the" "dark"  "blue"   "background." ""     
[3,] "It's" "easy"  "to"    "tell" "the" "depth" "of"     "a"           "well."
> x2 = str_split(x," ")[[1]]
> str_c(x2,collapse = " ")    #内部连接
[1] "The birch canoe slid on the smooth planks."
> str_c(x2,1234,sep = "+")    #外部连接
[1] "The+1234"     "birch+1234"   "canoe+1234"   "slid+1234"    "on+1234"     
[6] "the+1234"     "smooth+1234"  "planks.+1234"

3.3 提取字符串的一部分

> str_sub(x,5,9)
[1] "birch"

3.4 大小写转换

> str_to_upper(x2)
[1] "THE"     "BIRCH"   "CANOE"   "SLID"    "ON"      "THE"     "SMOOTH"  "PLANKS."
> str_to_lower(x2)
[1] "the"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth"  "planks."
> str_to_title(x2)
[1] "The"     "Birch"   "Canoe"   "Slid"    "On"      "The"     "Smooth"  "Planks."

3.5 字符串排序

> str_sort(x2)
[1] "birch"   "canoe"   "on"      "planks." "slid"    "smooth"  "the"     "The"  

3.6 字符检测

> str_detect(x2,"h")    #********* 检验每个字符中是否含有h这个字母,含有返回TRUE,否则FALSE
[1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
> str_starts(x2,"T")
[1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> str_ends(x2,"e")
[1]  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE

与sum和mean连用,可以统计匹配的个数和比例

sum(str_detect(x2,"h"))
mean(str_detect(x2,"h"))

3.7 提取匹配到的字符串

> str_subset(x2,"h")
[1] "The"    "birch"  "the"    "smooth"
> x2[str_detect(x2,"h")]    #等价
[1] "The"    "birch"  "the"    "smooth"

3.8 字符计数

> str_count(x," ")
[1] 7
> x2
[1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth"  "planks."
> str_count(x2,"o")
[1] 0 0 1 0 1 0 2 0

3.9 字符串替换

> x2
[1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth"  "planks."
> ###9.字符串替换
> str_replace(x2,"o","A")   #只替换第一个o
[1] "The"     "birch"   "canAe"   "slid"    "An"      "the"     "smAoth"  "planks."
> str_replace_all(x2,"o","A")  #全部替换
[1] "The"     "birch"   "canAe"   "slid"    "An"      "the"     "smAAth"  "planks."

4.条件语句和循环语句

4.1 if语句

4.1.1 if(){}

(1) 只有if没有else,那么条件是FALSE时就什么都不做
(2) 有else

> i =1
> if (i>0){
+   cat('+')   #cat()与print()的区别,cat()打印出本来的样子,print()打印出向量的第一个元素
+ } else {
+   print("-")
+ }
+

上面语句与下面等价:

> ifelse(i>0,"+","-")
[1] "+"

(3)多个条件

> i = 0
> if (i>0){
+   print('+')
+ } else if (i==0) {
+   print('0')
+ } else if (i< 0){
+   print('-')
+ }
[1] "0"

与下面语句等价:

> ifelse(i>0,
+        "+",
+        ifelse((i<0),
+               "-",
+               "0"))
[1] "0"

case_when() dplyr中函数,可以多个嵌套使用

4.1.2 switch()

> require(stats)
> centre <- function(x, type) {
+     switch(type,
+            mean = mean(x),
+            median = median(x),
+            trimmed = mean(x, trim = .1))
+ }
> x <- rcauchy(10)
> centre(x, "mean")
[1] -1.441392
> centre(x, "median")
[1] -0.7271923
> centre(x, "trimmed")
[1] -1.310942

为什么匹配的是cc这个值?

> cd = 3
> foo <- switch(EXPR = cd, 
+               #EXPR = "aa", 
+               aa=c(3.4,1),
+               bb=matrix(1:4,2,2),
+               cc=matrix(c(T,T,F,T,F,F),3,2),
+               dd="string here",
+               ee=matrix(c("red","green","blue","yellow")))
> foo
      [,1]  [,2]
[1,]  TRUE  TRUE
[2,]  TRUE FALSE
[3,] FALSE FALSE

4.2 循环语句

4.2.1 for循环

顺便看一下nextbreak
do.call(cbind,result) #按列组合列表中的每一个元素

4.2.2 while循环

4.2.3 repeat语句

注意必须要有break
next 结束本轮循环
break结束循环

5. apply()族函数

5.1 apply处理矩阵或数据框

apply(X, MARGIN, FUN, …)
其中X是数据框/矩阵名
MARGIN为1表示取行,为2表示取列,FUN是函数

5.2 lapply(list, FUN,...)

列表/向量中的每个元素(向量)实施相同的操作, 返回值是一个列表

5.3 sapply处理列表,简化结果,直接返回矩阵和向量

sapply(X, FUN, …) 注意和lapply的区别,返回值不一样

posted @ 2021-03-01 10:07  stdforml  阅读(202)  评论(0编辑  收藏  举报