如何获取所有基因的转录起始位点--转载
http://blog.sciencenet.cn/blog-2985160-948631.html
我们在做人类全基因组分析的时候,经常需要找出基因组中所有基因的转录起始位点(Transcription Start Site, TSS),利用R/Bioconductor很容易做到。
用到一个包 Homo.sapiens,其中包含了目前已知的所有基因的注释信息,当然还有其他的包也含有所有基因的注释信息。
下面是获取的过程:
# 加载包
> library(Homo.sapiens)
#获所有基因
> all_genes <- genes(Homo.sapiens)
#查看前几个基因信息,
> head(all_genes)
GRanges object with 6 ranges and 1 metadata column:
seqnames ranges strand | GENEID
<Rle> <IRanges> <Rle> | <FactorList>
1 chr19 [ 58858172, 58874214] - | 1
10 chr8 [ 18248755, 18258723] + | 10
100 chr20 [ 43248163, 43280376] - | 100
1000 chr18 [ 25530930, 25757445] - | 1000
10000 chr1 [243651535, 244006886] - | 10000
100008586 chrX [ 49217763, 49233491] + | 100008586
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
#查看所有基因数,总共有23056个基因
> length(all_genes)
[1] 23056
#获得基因转录起始位点
> all_gene_TSS <- resize(all_genes,1)
> all_gene_TSS
GRanges object with 23056 ranges and 1 metadata column:
seqnames ranges strand | GENEID
<Rle> <IRanges> <Rle> | <FactorList>
1 chr19 [ 58874214, 58874214] - | 1
10 chr8 [ 18248755, 18248755] + | 10
100 chr20 [ 43280376, 43280376] - | 100
1000 chr18 [ 25757445, 25757445] - | 1000
10000 chr1 [244006886, 244006886] - | 10000
... ... ... ... ... ...
9991 chr9 [115095944, 115095944] - | 9991
9992 chr21 [ 35736323, 35736323] + | 9992
9993 chr22 [ 19109967, 19109967] - | 9993
9994 chr6 [ 90539619, 90539619] + | 9994
9997 chr22 [ 50964905, 50964905] - | 9997
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
#查看前10个基因的TSS
> head(start(all_gene_TSS), n=10)
[1] 58874214 18248755 43280376 25757445 244006886 49217763 101395274 71067384 72102894
[10] 89867818
# 获得TSS上下游 各100 bp的位置
> TSS_100 <-promoters(all_gene_TSS, 100, 100)
> TSS_100
GRanges object with 23056 ranges and 1 metadata column:
seqnames ranges strand | GENEID
<Rle> <IRanges> <Rle> | <FactorList>
1 chr19 [ 58874115, 58874314] - | 1
10 chr8 [ 18248655, 18248854] + | 10
100 chr20 [ 43280277, 43280476] - | 100
1000 chr18 [ 25757346, 25757545] - | 1000
10000 chr1 [244006787, 244006986] - | 10000
... ... ... ... ... ...
9991 chr9 [115095845, 115096044] - | 9991
9992 chr21 [ 35736223, 35736422] + | 9992
9993 chr22 [ 19109868, 19110067] - | 9993
9994 chr6 [ 90539519, 90539718] + | 9994
9997 chr22 [ 50964806, 50965005] - | 9997
#获得TSS上游2000bp和下游500bp的序列,我们通常认为是基因的启动子部分,这里我们还需要加载一个包就是BSgenome.Hsapiens.UCSC.hg19,其中包含了人类整个基因组的序列。
> library(BSgenome.Hsapiens.UCSC.hg19)
> promoter <-promoters(all_gene_TSS, 2000, 500)
> promoter
GRanges object with 23056 ranges and 1 metadata column:
seqnames ranges strand | GENEID
<Rle> <IRanges> <Rle> | <FactorList>
1 chr19 [ 58873715, 58876214] - | 1
10 chr8 [ 18246755, 18249254] + | 10
100 chr20 [ 43279877, 43282376] - | 100
1000 chr18 [ 25756946, 25759445] - | 1000
10000 chr1 [244006387, 244008886] - | 10000
... ... ... ... ... ...
9991 chr9 [115095445, 115097944] - | 9991
9992 chr21 [ 35734323, 35736822] + | 9992
9993 chr22 [ 19109468, 19111967] - | 9993
9994 chr6 [ 90537619, 90540118] + | 9994
9997 chr22 [ 50964406, 50966905] - | 9997
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
> seq <- BSgenome.Hsapiens.UCSC.hg19
> promoter_seq <- getSeq(seq, promoter)
> promoter_seq
A DNAStringSet instance of length 23056
width seq names
[1] 2500 CACACACGGCTAATTTTTGTATTTTTAGT...CCCTGCCGCGCCATCATTTCTTCCCACA 1
[2] 2500 CTCTCCCACACTCAGTCAAAAATGGTCCA...CACATATTGAAATGGTCTTGCAAAACCA 10
[3] 2500 CTCACAGCAGGGAGCCCAGGCTTCTCAAA...GAGGTCTCTGAAGCTCAGCTGTATGATC 100
[4] 2500 TCTAGCAAAAAAAGAAGAGAAAGGGTAAG...CGGGAGCGCTGCGGACCCTGCTGCCGCT 1000
[5] 2500 CAGGTTCTCACTGTAGCACCCAGGCACGG...CAAACCAAAAATAATACGGTTGGTAAGA 10000
... ... ...
[23052] 2500 AGGTGTGAGCCACCATGCCCAGCTATAAT...CCCGGTCCGGCCGCGGTGCCGAGGTCCG 9991
[23053] 2500 CAGGTGGCACCGTCTCCTAGCGGAATTCT...GCGGTGGCTCACGCCTGTAATCCCAGCA 9992
[23054] 2500 CCTTCTTCCCTAACGCTGACTGCCCACTG...CACTCTCTGCGGACGCCTGCTGGAGCTT 9993
[23055] 2500 TACATAACTTAGGTGGAGTGGCTCATACC...TAGGCATCAAACCAACATGCCTAAATAA 9994
[23056] 2500 TGATGATGGAGACCCTGGCCAGAATCACT...CGTGGTGAGCGCCGCCCCCGCCCTGCTG 9997
----结束
#Session Infromation
R version 3.2.0 (2015-04-16)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows XP (build 2600) Service Pack 3
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] Homo.sapiens_1.1.2 TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2
[3] org.Hs.eg.db_3.1.2 GO.db_3.1.2
[5] RSQLite_1.0.0 DBI_0.3.1
[7] OrganismDbi_1.10.0 GenomicFeatures_1.20.1
[9] AnnotationDbi_1.30.1 Biobase_2.28.0
[11] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.36.0
[13] rtracklayer_1.28.4 Biostrings_2.36.1
[15] XVector_0.8.0 GenomicRanges_1.20.4
[17] GenomeInfoDb_1.4.0 IRanges_2.2.2
[19] S4Vectors_0.6.0 BiocGenerics_0.14.0
loaded via a namespace (and not attached):
[1] graph_1.46.0 zlibbioc_1.14.0 GenomicAlignments_1.4.1
[4] BiocParallel_1.2.2 tools_3.2.0 lambda.r_1.1.7
[7] futile.logger_1.4.1 RBGL_1.44.0 futile.options_1.0.0
[10] bitops_1.0-6 biomaRt_2.24.0 RCurl_1.95-4.6
[13] Rsamtools_1.20.4 XML_3.98-1.2