可变剪切 | isoform | 提取特定exon的usage | DEXSeq
2023年06月19日
注意基因组的版本,基因的坐标是完全不一样的,区别human和mouse。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | # cd ~/tmpData/ApcKO_cellranger/cellranger_report/ # human hg38 chr6:15246069-15522042 # mouse GRCm39 chr13:44882950-45075119 # mouse mm10 chr13:44731271-44921643 samtools view -h WT2_report /outs/gex_possorted_bam .bam "chr13:44731271-44921643" > WT2_Jarid2.sam samtools view -h D4_report /outs/gex_possorted_bam .bam "chr13:44731271-44921643" > D4_Jarid2.sam samtools view -h D8_report /outs/gex_possorted_bam .bam "chr13:44731271-44921643" > D8_Jarid2.sam samtools view -h D21_report /outs/gex_possorted_bam .bam "chr13:44731271-44921643" > D21_Jarid2.sam python ~ /softwares/self_bin/dexseq_count .py /home/zz950/reference/refdata-cellranger-arc-mm10-2020-A-2 .0.0 /genes/genes .gff WT2_Jarid2.sam WT2_Jarid2.count.txt python ~ /softwares/self_bin/dexseq_count .py /home/zz950/reference/refdata-cellranger-arc-mm10-2020-A-2 .0.0 /genes/genes .gff D4_Jarid2.sam D4_Jarid2.count.txt python ~ /softwares/self_bin/dexseq_count .py /home/zz950/reference/refdata-cellranger-arc-mm10-2020-A-2 .0.0 /genes/genes .gff D8_Jarid2.sam D8_Jarid2.count.txt python ~ /softwares/self_bin/dexseq_count .py /home/zz950/reference/refdata-cellranger-arc-mm10-2020-A-2 .0.0 /genes/genes .gff D21_Jarid2.sam D21_Jarid2.count.txt sed 's/\"//g' WT2_Jarid2.count.txt > WT2_Jarid2.count.clean.txt sed 's/\"//g' D4_Jarid2.count.txt > D4_Jarid2.count.clean.txt sed 's/\"//g' D8_Jarid2.count.txt > D8_Jarid2.count.clean.txt sed 's/\"//g' D21_Jarid2.count.txt > D21_Jarid2.count.clean.txt |
2023年03月16日
RNA Splicing双管齐下:
SUPPA2是以AS event为基础的分析,根据transcript的TPM来计算的;
DEXSeq则是以exon usage为基础的分析,直接对exon计数用DEseq处理,结果直观;
代码参考:http://localhost:17435/notebooks/tmpData/ApcKO_cellranger/psi/R_process.ipynb
1 2 3 4 5 6 7 8 9 10 11 | # BiocManager::install("DEXSeq") python ~ /softwares/self_bin/dexseq_prepare_annotation .py genes.gtf.gz genes.gff samtools view -h WT2_report /outs/gex_possorted_bam .bam "chr5:125017153-125179219" > WT2_Ncor2.sam samtools view -h D21_report /outs/gex_possorted_bam .bam "chr5:125017153-125179219" > D21_Ncor2.sam python ~ /softwares/self_bin/dexseq_count .py /home/zz950/reference/refdata-cellranger-arc-mm10-2020-A-2 .0.0 /genes/genes .gff WT2_Ncor2.sam WT2_Ncor2.count.txt python ~ /softwares/self_bin/dexseq_count .py /home/zz950/reference/refdata-cellranger-arc-mm10-2020-A-2 .0.0 /genes/genes .gff D21_Ncor2.sam D21_Ncor2.count.txt sed 's/\"//g' WT2_Ncor2.count.txt > WT2_Ncor2.count.clean.txt sed 's/\"//g' D21_Ncor2.count.txt > D21_Ncor2.count.clean.txt |
参考:Inferring differential exon usage in RNA-Seq data with the DEXSeq package
http://bioconductor.org/packages/release/bioc/html/DEXSeq.html 【教程已经非常清楚,需要耐心仔细的研读】
跟着tutorial走一遍【生信码农的基本功:工具测试评比】
关键问题:
- gtf文件 - 必须是来自Ensembl的,否则没法正确运行,版本和格式是永恒的浪费时间的点
- 单细胞bam - 必须合并,不然出来的都是0,全部都检测不到
- 运行速度 - Python的wrapper是真的慢
更新:
之前一直模糊的概念,这里必须搞清楚!!!
read count
- 每个exon or exonic part的read count
- 每个样本的read count总和,即library size 测序深度
- 每个gene的read count总和,即该基因的所有transcript的read总数
- relative usage of an exon,即 this / others,该exon的read总数 除以 比对到该gene其他exon的read总数 = 相对exon usage
问题:
- 把count数据转成CPM就能比较exon的CPM吗? - 不能,exon层面就不能简单用gene那一套了,必须先关注这个gene本身是否有差异表达
- relative usage of an exon的数据可比吗? - 这就是校正了gene表达总量后的值,用来比较就更有意义了。
参考:
关于normalization - Introduction to DGE - ARCHIVED
先拆
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | ls archive_20201222 /bam/BAM .sortedByCoord.out/*.bam | sed "s:^:`pwd`/: " > all.bam.csv cat all.bam.csv | grep "10c2" > 10c2.list cat all.bam.csv | grep "17C8" > 17C8.list cat all.bam.csv | grep "1C11" > 1C11.list cat all.bam.csv | grep "20c7" > 20c7.list cat all.bam.csv | grep "23C9" > 23C9.list cat all.bam.csv | grep "3C15" > 3C15.list cat all.bam.csv | grep "5c3" > 5c3.list cat all.bam.csv | grep "6c5" > 6c5.list cat all.bam.csv | grep "/IMR90" > IMR90.list cat all.bam.csv | grep "IMR-N_Diff-D20" > IMR-N_Diff-D20.list cat all.bam.csv | grep "IMR-N_Diff-D40" > IMR-N_Diff-D40.list cat all.bam.csv | grep "iPSC-IMR90" > iPSC-IMR90.list cat all.bam.csv | grep "iPSC-UE" > iPSC-UE.list cat all.bam.csv | grep "/UE" | grep - v "N_Diff" > UE.list cat all.bam.csv | grep "UE-N_Diff_D20" > UE-N_Diff_D20.list cat all.bam.csv | grep "UE-N_Diff-D40" > UE-N_Diff-D40.list split -d -n l /5 10c2.list 10c2.list split -d -n l /5 17C8.list 17C8.list split -d -n l /5 1C11.list 1C11.list split -d -n l /5 20c7.list 20c7.list split -d -n l /5 23C9.list 23C9.list split -d -n l /5 3C15.list 3C15.list split -d -n l /5 5c3.list 5c3.list split -d -n l /5 6c5.list 6c5.list split -d -n l /5 IMR90.list IMR90.list split -d -n l /5 IMR-N_Diff-D20.list IMR-N_Diff-D20.list split -d -n l /5 IMR-N_Diff-D40.list IMR-N_Diff-D40.list split -d -n l /5 iPSC-IMR90.list iPSC-IMR90.list split -d -n l /5 iPSC-UE.list iPSC-UE.list split -d -n l /5 UE.list UE.list split -d -n l /5 UE-N_Diff_D20.list UE-N_Diff_D20.list split -d -n l /5 UE-N_Diff-D40.list UE-N_Diff-D40.list |
合并
1 2 | samtools merge -b UE.bam.list -@ 2 -r -p UE.bam samtools merge -b UE.bam.list -@ 2 -r -p UE.bam |
批量【先合并两个测试一下】
1 2 3 4 5 6 | for i in ` ls {IMR90.list0*,17C8.list0*}` do echo $i samtools merge -b $i -@ 2 -r -p ${i}.bam &&\ echo "done!!!" done |
准备特殊的注释文件 & 计数
gencode版本
1 2 3 4 5 6 | # gencode # ~/databases/hg19/ python ~ /softwares/R_lib_361/DEXSeq/python_scripts/dexseq_prepare_annotation .py gencode.v27.annotation.gtf gencode.v27.annotation.DEXSeq.chr.gff # ~/project/scRNA-seq/rawData/smart-seq/archive_20201222/bam/BAM.sortedByCoord.out/IMR90-E2-A17_STAR_Aligned.sortedByCoord.out.bam python ~ /softwares/R_lib_361/DEXSeq/python_scripts/dexseq_count .py -p yes -f bam ~ /databases/hg19/gencode .v27.annotation.DEXSeq.chr.gff ~ /project/scRNA-seq/rawData/smart-seq/merged .bams /bam/IMR90 .list00.bam tmp.result.txt |
ensemble版本
1 2 3 4 5 6 | ############################################ # ensemble # ~/references/SUPPA2/ref/hg19/ python ~ /softwares/R_lib_361/DEXSeq/python_scripts/dexseq_prepare_annotation .py Homo_sapiens.GRCh37.75.formatted.gtf Homo_sapiens.GRCh37.75.DEXSeq.chr.gff python ~ /softwares/R_lib_361/DEXSeq/python_scripts/dexseq_count .py -f bam ~ /references/SUPPA2/ref/hg19/Homo_sapiens .GRCh37.75.DEXSeq.chr.gff ~ /project/scRNA-seq/rawData/smart-seq/merged .bams /bam/IMR90 .list00.bam tmp.result.txt |
批量运行
1 2 3 4 5 6 | # 批量运行 for i in ` ls ~ /project/scRNA-seq/rawData/smart-seq/merged .bams /bam/ *.bam` do python ~ /softwares/R_lib_361/DEXSeq/python_scripts/dexseq_count .py -f bam ~ /references/SUPPA2/ref/hg19/Homo_sapiens .GRCh37.75.DEXSeq.chr.gff $i ${i}.result.txt &&\ echo $i done |
文件如下:
1 2 3 4 5 6 7 8 9 10 | chr15 dexseq_prepare_annotation.py aggregate_gene 72491370 72524164 . - . gene_id "ENSG00000067225" chr15 dexseq_prepare_annotation.py exonic_part 72491370 72491372 . - . gene_id "ENSG00000067225" ; transcripts "ENST00000564440+ENST00000319622" ; exonic_part_number "001" chr15 dexseq_prepare_annotation.py exonic_part 72491373 72491376 . - . gene_id "ENSG00000067225" ; transcripts "ENST00000319622+ENST00000565184+ENST00000389093+ENST00000335181+ENST00000568883+ENST00000564440" ; exonic_part_number "002" chr15 dexseq_prepare_annotation.py exonic_part 72491377 72491445 . - . gene_id "ENSG00000067225" ; transcripts "ENST00000319622+ENST00000565184+ENST00000389093+ENST00000335181+ENST00000565143+ENST00000568883+ENST00000564440" ; exonic_part_number "003" chr15 dexseq_prepare_annotation.py exonic_part 72491446 72491930 . - . gene_id "ENSG00000067225" ; transcripts "ENST00000567118+ENST00000319622+ENST00000565184+ENST00000389093+ENST00000335181+ENST00000449901+ENST00000565143+ENST00000568883+ENST00000564440" ; exonic_part_number "004" |
很明确,每一个都是pseudo exon【有时候会把exon切得非常碎,需要借助可视化工具来操作】
1 2 3 | # check # https://www.ncbi.nlm.nih.gov/gene/5315 PKM ENSG00000067225 |
总结:
- 对于大型bam文件来说,这个计数实在是太慢了,需要耐心等待
- 操作明确,对于pseudo exon进行read的计数,然后用已有的成熟工具来做差异分析
- 该工具画图居然需要root权限,实在是匪夷所思
可以作为一个AS分析的备选工具。
参考分析目录:
CGS: project/scPipeline/AS/DEXSeq/DEXSeq.ipynb
CGS: project/scRNA-seq/rawData/smart-seq/merged.bams/
待续~
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· winform 绘制太阳,地球,月球 运作规律
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)