biostring 序列的基本操作

还是先获取随机DNA序列和其他序列对象:

library(Biostrings)
rndSeq <- function(dict, n) {
    paste(sample(dict, n, replace = T), collapse = "")
}
set.seed(0)
# 用mapply和rndSeq函数获取5条序列(字符串):
DNA.raw <- mapply(rndSeq, list(DNA_BASES), rep(20, 5))
names(DNA.raw) <- paste("SEQ", 1:5, sep = "-")
# DNAString对象,1条序列
DNA.str <- DNAString(DNA.raw[1])
# DNAStringSet对象,含5条序列
DNA.set <- DNAStringSet(DNA.raw)
# Views对象
DNA.vws <- successiveViews(DNA.str, width = rep(4, 5))

一、获取序列基本信息

包括获取名称(names)、长度(length)、字符个数(nchar)和对象头/尾(head/tail)等信息的函数。
函数的用法简单,但需注意XString类对象的返回结果和其他类型有些差别:

# length函数----------------------
length(DNA.raw)
## [1] 5
# 结果为序列的长度
length(DNA.str)
## [1] 20
length(DNA.set)
## [1] 5
length(DNA.vws)
## [1] 5
# nchar函数-----------------------
nchar(DNA.raw)
## SEQ-1 SEQ-2 SEQ-3 SEQ-4 SEQ-5 
##    20    20    20    20    20
nchar(DNA.str)
## [1] 20
nchar(DNA.set)
## [1] 20 20 20 20 20
nchar(DNA.vws)
## [1] 4 4 4 4 4
# head/tail函数-------------------
head(DNA.raw, 2)
##                  SEQ-1                  SEQ-2 
## "TCCGTATTGGAAAGCTCGTC" "TTAGACCACTCCGCATGTAG"
# 结果为序列的前几个碱基
head(DNA.str, 2)
##   2-letter "DNAString" instance
## seq: TC
head(DNA.set, 2)
##   A DNAStringSet instance of length 2
##     width seq                                          names               
## [1]    20 TCCGTATTGGAAAGCTCGTC                         SEQ-1
## [2]    20 TTAGACCACTCCGCATGTAG                         SEQ-2
head(DNA.vws, 2)
##   Views on a 20-letter DNAString subject
## subject: TCCGTATTGGAAAGCTCGTC
## views:
##     start end width
## [1]     1   4     4 [TCCG]
## [2]     5   8     4 [TATT]

二、序列转换

1、获取反向、互补、反向互补序列:

reverse(), complement(), reverseComplement()可以使用string, XString, XXXSet和Views类对象进行操作。下面看看reverse函数的结果:

DNA.raw
##                  SEQ-1                  SEQ-2                  SEQ-3 
## "TCCGTATTGGAAAGCTCGTC" "TTAGACCACTCCGCATGTAG" "CTGTGGTACGGCTCAAACGG" 
##                  SEQ-4                  SEQ-5 
## "CTCCCGCCTATCTCCCTTCT" "TCGCCTAGAAAAAGTTTCCT"
reverse(DNA.raw)
##                  SEQ-1                  SEQ-2                  SEQ-3 
## "CTGCTCGAAAGGTTATGCCT" "GATGTACGCCTCACCAGATT" "GGCAAACTCGGCATGGTGTC" 
##                  SEQ-4                  SEQ-5 
## "TCTTCCCTCTATCCGCCCTC" "TCCTTTGAAAAAGATCCGCT"
reverse(DNA.str)
##   20-letter "DNAString" instance
## seq: CTGCTCGAAAGGTTATGCCT
reverse(DNA.set)
##   A DNAStringSet instance of length 5
##     width seq                                          names               
## [1]    20 CTGCTCGAAAGGTTATGCCT                         SEQ-1
## [2]    20 GATGTACGCCTCACCAGATT                         SEQ-2
## [3]    20 GGCAAACTCGGCATGGTGTC                         SEQ-3
## [4]    20 TCTTCCCTCTATCCGCCCTC                         SEQ-4
## [5]    20 TCCTTTGAAAAAGATCCGCT                         SEQ-5
reverse(DNA.vws)
##   Views on a 20-letter DNAString subject
## subject: CTGCTCGAAAGGTTATGCCT
## views:
##     start end width
## [1]    17  20     4 [GCCT]
## [2]    13  16     4 [TTAT]
## [3]     9  12     4 [AAGG]
## [4]     5   8     4 [TCGA]
## [5]     1   4     4 [CTGC]

2、序列翻译:

翻译函数translate()只能使用XString和XXXSet类对象(Biostrings version 2.29.3):

# 错误
translate(DNA.raw)
## Error: unable to find an inherited method for function 'translate' for
## signature '"character"'
translate(DNA.str)
##   6-letter "AAString" instance
## seq: SVLESS
# 错误
translate(DNA.set)
##   A AAStringSet instance of length 5
##     width seq
## [1]     6 SVLESS
## [2]     6 LDHSAC
## [3]     6 LWYGSN
## [4]     6 LPPISL
## [5]     6 SPRKSF
# 错误
translate(DNA.vws)
## Error: unable to find an inherited method for function 'translate' for
## signature '"XStringViews"'

3、DNA/RNA互转:

使用dna2rna, rna2dna或cDNA函数。这些函数对于数据对象有严格的要求:

DNA.str
##   20-letter "DNAString" instance
## seq: TCCGTATTGGAAAGCTCGTC
dna2rna(DNA.str)
##   20-letter "RNAString" instance
## seq: UCCGUAUUGGAAAGCUCGUC
# 错误
dna2rna(DNA.raw)
## Error: dna2rna() only works on DNA input
# 错误
cDNA(DNA.str)
## Error: cDNA() only works on RNA input
cDNA(dna2rna(DNA.str))
##   20-letter "DNAString" instance
## seq: AGGCATAACCTTTCGAGCAG

4、其他:

R base包的chartr函数已经重载,可以应用于序列对象,但最好避免使用,因为会出现符号检查问题:

chartr("T", "A", DNA.set)
##   A DNAStringSet instance of length 5
##     width seq                                          names               
## [1]    20 ACCGAAAAGGAAAGCACGAC                         SEQ-1
## [2]    20 AAAGACCACACCGCAAGAAG                         SEQ-2
## [3]    20 CAGAGGAACGGCACAAACGG                         SEQ-3
## [4]    20 CACCCGCCAAACACCCAACA                         SEQ-4
## [5]    20 ACGCCAAGAAAAAGAAACCA                         SEQ-5
# 错误
chartr("T", "U", DNA.set)
## Error: key 85 (char 'U') not in lookup table

其他一些R base包的字符串操作函数也可以使用序列对象,但注意返回结果的类型。下面两行代码的返回值都是字符串向量,而不是DNAStringSet对象:

tolower(DNA.set)
##                  SEQ-1                  SEQ-2                  SEQ-3 
## "tccgtattggaaagctcgtc" "ttagaccactccgcatgtag" "ctgtggtacggctcaaacgg" 
##                  SEQ-4                  SEQ-5 
## "ctcccgcctatctcccttct" "tcgcctagaaaaagtttcct"
gsub("T", "U", DNA.set)
##                  SEQ-1                  SEQ-2                  SEQ-3 
## "UCCGUAUUGGAAAGCUCGUC" "UUAGACCACUCCGCAUGUAG" "CUGUGGUACGGCUCAAACGG" 
##                  SEQ-4                  SEQ-5 
## "CUCCCGCCUAUCUCCCUUCU" "UCGCCUAGAAAAAGUUUCCU"

三、序列截取(sequence subset)

1、使用序列构造函数

Biostrings的XXXSet类和Views类序列构造函数本身就具备序列subset的功能,Views类对象可以通过类型转换获得XXXSet类对象。具体使用方法请参看上一篇文章

2、subseq函数

subseq函数可以用于序列截取,也可以对选定序列进行修改:

DNAstr <- DNAString(rndSeq(DNA_BASES, 50))
(subseq(DNAstr, start = 20, end = 30))
##   11-letter "DNAString" instance
## seq: CGTCCATCGAA
# 上面语句的subseq函数仅仅是截取了序列,没有改变原序列
DNAstr
##   50-letter "DNAString" instance
## seq: GGCCTGAACTGTGCCAAGACGTCCATCGAAGGAAGTGGGTGGGACGCAGA
# 下面语句的subseq函数改变了原序列
subseq(DNAstr, start = 20, end = 30) <- DNAString("NNN")
DNAstr
##   42-letter "DNAString" instance
## seq: GGCCTGAACTGTGCCAAGANNNGGAAGTGGGTGGGACGCAGA

3、特殊序列查找和截取

回文(palindrome)序列相关的函数,有:

findPalindromes(subject, min.armlength = 4, max.looplength = 1, min.looplength = 0, 
    max.mismatch = 0)
palindromeArmLength(x, max.mismatch = 0, ...)
palindromeLeftArm(x, max.mismatch = 0, ...)
palindromeRightArm(x, max.mismatch = 0, ...)
findComplementedPalindromes(subject, min.armlength = 4, max.looplength = 1, 
    min.looplength = 0, max.mismatch = 0)
complementedPalindromeArmLength(x, max.mismatch = 0, ...)
complementedPalindromeLeftArm(x, max.mismatch = 0, ...)
complementedPalindromeRightArm(x, max.mismatch = 0, ...)

有应用需求的自己去看看函数说明吧。

四、字符或寡核苷酸组合的统计

1、letterFrequency函数

这个函数用于统计指定字符的频率或比例:

letterFrequency(DNA.set, DNA_BASES)
##      A  C G T
## [1,] 4  5 5 6
## [2,] 5  6 4 5
## [3,] 4  5 7 4
## [4,] 1 11 1 7
## [5,] 6  5 3 6
letterFrequency(DNA.set, DNA_ALPHABET)
##      A  C G T M R W S Y K V H D B N - +
## [1,] 4  5 5 6 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 5  6 4 5 0 0 0 0 0 0 0 0 0 0 0 0 0
## [3,] 4  5 7 4 0 0 0 0 0 0 0 0 0 0 0 0 0
## [4,] 1 11 1 7 0 0 0 0 0 0 0 0 0 0 0 0 0
## [5,] 6  5 3 6 0 0 0 0 0 0 0 0 0 0 0 0 0
letterFrequency(DNA.set, DNA_BASES, as.prob = TRUE)
##         A    C    G    T
## [1,] 0.20 0.25 0.25 0.30
## [2,] 0.25 0.30 0.20 0.25
## [3,] 0.20 0.25 0.35 0.20
## [4,] 0.05 0.55 0.05 0.35
## [5,] 0.30 0.25 0.15 0.30
# 注意下面这种用法:
letterFrequency(DNA.set, "GC", as.prob = TRUE)
##      G|C
## [1,] 0.5
## [2,] 0.5
## [3,] 0.6
## [4,] 0.6
## [5,] 0.4

2、letterFrequencyInSlidingView函数

该函数按设置的窗口长度(view.width)一个个碱基滑动并统计字符频率或比例:

letterFrequencyInSlidingView(DNA.set[[1]], view.width = 16, letter = DNA_BASES)
##      A C G T
## [1,] 4 3 4 5
## [2,] 4 4 4 4
## [3,] 4 3 5 4
## [4,] 4 2 5 5
## [5,] 4 3 4 5
letterFrequencyInSlidingView(DNA.set[[1]], 16, "GC", as.prob = TRUE)
##         G|C
## [1,] 0.4375
## [2,] 0.5000
## [3,] 0.5000
## [4,] 0.4375
## [5,] 0.4375

3、alphabetFrequency函数

作用与letterFrequency函数类似,但按ALPHABET中的所有因子进行统计。baseOnly设置为TRUE可以对ALPHABET进行限制:

alphabetFrequency(DNA.set)
##      A  C G T M R W S Y K V H D B N - +
## [1,] 4  5 5 6 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 5  6 4 5 0 0 0 0 0 0 0 0 0 0 0 0 0
## [3,] 4  5 7 4 0 0 0 0 0 0 0 0 0 0 0 0 0
## [4,] 1 11 1 7 0 0 0 0 0 0 0 0 0 0 0 0 0
## [5,] 6  5 3 6 0 0 0 0 0 0 0 0 0 0 0 0 0
alphabetFrequency(DNA.set, as.prob = TRUE, baseOnly = TRUE)
##         A    C    G    T other
## [1,] 0.20 0.25 0.25 0.30     0
## [2,] 0.25 0.30 0.20 0.25     0
## [3,] 0.20 0.25 0.35 0.20     0
## [4,] 0.05 0.55 0.05 0.35     0
## [5,] 0.30 0.25 0.15 0.30     0

4、寡核苷酸统计:

有以下函数,分别统计2核苷酸组合、3核苷酸组合和寡核苷酸组合:

dinucleotideFrequency
trinucleotideFrequency
oligonucleotideFrequency

五、杂项函数

包括使用序列对象进行运算的一些函数(和符号):

# 反转对象元素的顺序,不是反转序列!
rev(DNA.set)
##   A DNAStringSet instance of length 5
##     width seq                                          names               
## [1]    20 TCGCCTAGAAAAAGTTTCCT                         SEQ-5
## [2]    20 CTCCCGCCTATCTCCCTTCT                         SEQ-4
## [3]    20 CTGTGGTACGGCTCAAACGG                         SEQ-3
## [4]    20 TTAGACCACTCCGCATGTAG                         SEQ-2
## [5]    20 TCCGTATTGGAAAGCTCGTC                         SEQ-1
# 判断两个对象是否相同
DNA.set[[1]] == DNA.set[[2]]
## [1] FALSE
# 判断对象是否包含在另外一个对象的元素中
DNA.str %in% DNA.set
## [1] TRUE

Biostrings在2.0版后还添加了append函数,可以把几个序列集合合并,还可以使用c函数进行合并:

append(DNA.set, DNAStringSet(DNA.str))
##   A DNAStringSet instance of length 6
##     width seq                                          names               
## [1]    20 TCCGTATTGGAAAGCTCGTC                         SEQ-1
## [2]    20 TTAGACCACTCCGCATGTAG                         SEQ-2
## [3]    20 CTGTGGTACGGCTCAAACGG                         SEQ-3
## [4]    20 CTCCCGCCTATCTCCCTTCT                         SEQ-4
## [5]    20 TCGCCTAGAAAAAGTTTCCT                         SEQ-5
## [6]    20 TCCGTATTGGAAAGCTCGTC
c(DNA.set, DNAStringSet(DNA.str))
##   A DNAStringSet instance of length 6
##     width seq                                          names               
## [1]    20 TCCGTATTGGAAAGCTCGTC                         SEQ-1
## [2]    20 TTAGACCACTCCGCATGTAG                         SEQ-2
## [3]    20 CTGTGGTACGGCTCAAACGG                         SEQ-3
## [4]    20 CTCCCGCCTATCTCCCTTCT                         SEQ-4
## [5]    20 TCGCCTAGAAAAAGTTTCCT                         SEQ-5
## [6]    20 TCCGTATTGGAAAGCTCGTC

除以上列出的例子外还有duplicated, unique, sort, order, split, relist等。随着Biostrings的完善,可能会添加更多的函数,使序列的运算更符合R语言的运算习惯。

参考资料

金子哦的博客:http://blog.csdn.net/u014801157/article/details/24372455

posted @ 2017-08-19 17:13  ywliao  阅读(2613)  评论(0编辑  收藏  举报