可变剪切调控因子motif基因富集分析 | motif enrichment | FIMO | MEME
类似篇:转录因子motif TSS区域富集分析 | motif enrichment | HOMER | FIMO | MEME
一个新的领域,现在我关注的是可变剪切调控因子,如PTBP1,它们有特定的RNA结合motif,类似TF。
相同点:
- 都是蛋白质的序列结合区域
- 有特定的序列motif
不同点:
- TF的motif主要结合在promoter和enhancer,负责基因转录
- ASF的motif主要结合在gene的intro区域,负责可变剪切
这里以PTBP1为例。
灵感来源文章:2018 - cancer cell - PTBP1-Mediated Alternative Splicing Regulates the Inflammatory Secretome and the Pro-tumorigenic Effects of Senescent Cells
RNA-Binding Motif Analysis
FIMO (Grant et al., 2011) was used to scan the human gene sequences for the PTBP1 RNA-binding motifs inferred by (Ray et al., 2013). The thereby predicted occurrences were mapped to the analyzed splicing events. To generate the RNA-maps (Figures 7B and S7D), for each comparison alternative exons were divided into those with PSIs significantly increasing upon PTBP1 knockdown (putatively repressed), those with PSIs significantly decreasing upon PTBP1 knockdown (putatively enhanced), and those with PSIs not altered upon PTBP1 knockdown (putatively not regulated). Statistical significance for local motif enrichment is associated with Fisher’s exact tests for differences in motif occurrences between groups of exons within 31 bp moving windows.
找RNA motif
查Ray et al., 2013,A compendium of RNA-binding motifs for decoding gene regulation
顺藤摸瓜,找到一个数据库:CISBP-RNA Database: Catalog of Inferred Sequence Binding Preferences of RNA binding proteins
操作,导出hg38的gene序列(包含exon和intro)
http://www.genome.ucsc.edu/cgi-bin/hgTables
用FIMO预测:https://meme-suite.org/meme/tools/fimo
得到短序列的motif的meme格式,网页版会给出来,下载即可。
MEME version 4 ALPHABET= ACGT strands: + - Background letter frequencies (from unknown source): A 0.250 C 0.250 G 0.250 T 0.250 MOTIF 1 HYTTTYT letter-probability matrix: alength= 4 w= 7 nsites= 1 E= 0e+0 0.333333 0.333333 0.000000 0.333333 0.000000 0.500000 0.000000 0.500000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.500000 0.000000 0.500000 0.000000 0.000000 0.000000 1.000000
fimo --alpha 1 --max-strand -oc target PTBP1.motif.meme hg38_gene.fasta
一个小的DNA、RNA、protein转换工具:http://biomodel.uah.es/en/lab/cybertory/analysis/trans.htm
注意:
motif与序列要匹配,DNA就是T,RNA就是U,不然无法匹配。
如果是RNA motif,则需要做一个反向互补的DNA motif
MEME version 4 ALPHABET= ACGT strands: + - Background letter frequencies (from unknown source): A 0.250 C 0.250 G 0.250 T 0.250 MOTIF 1 ARAAARD letter-probability matrix: alength= 4 w= 7 nsites= 1 E= 0e+0 1.000000 0.000000 0.000000 0.000000 0.500000 0.000000 0.500000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.500000 0.000000 0.500000 0.000000 0.333333 0.000000 0.333333 0.333333
fimo --alpha 1 --max-strand -oc target PTBP1.DNA.motif.meme hg38_gene.fasta --max-stored-scores 1000000 --thresh 1e-4
下次要用小数据测试,不然一晚上白跑了。
--max-strand
If matches on both strands at a given position satisfy the output threshold, only report the match for the strand with the higher score. If the scores are tied, the matching strand is chosen at random.
资源消耗统计
--max-stored-scores 1000000用到了1.48G内存,1个CPU
--max-stored-scores 10000000用到了内存,个CPU
最新命令:
fimo --max-stored-scores 10000000 --thresh 1e-4 --alpha 1 -oc target2 --text --max-strand PTBP1.DNA.motif.meme hg38_gene.fasta > output.tsv
fimo --max-stored-scores 10000000 --thresh 1e-4 --alpha 1 -oc target2 --skip-matched-sequence --max-strand PTBP1.DNA.motif.meme hg38_gene.fasta > output2.tsv
--skip-matched-sequence【超速输出,一个半小时缩短为10分钟】
Like the --text option, this limits output to tab-separated values (TSV) sent to standard out, but in addition, turns off output of the sequence of motif matches. This speeds up processing considerably.
--text【结果到标准输出】
Limits output to TSV (tab-separated values) formatted results sent to standard output. The results are unsorted and no q-values are output, allowing very large files to be searched.
参考:
~/project/scPipeline/motifEnrichment/ASF_motif/