可变剪切 | isoform | 提取特定exon的usage | DEXSeq

2023年06月19日

注意基因组的版本，基因的坐标是完全不一样的，区别human和mouse。

# cd ~/tmpData/ApcKO_cellranger/cellranger_report/
# human hg38 chr6:15246069-15522042
# mouse GRCm39 chr13:44882950-45075119
# mouse mm10 chr13:44731271-44921643
samtools view -h WT2_report/outs/gex_possorted_bam.bam "chr13:44731271-44921643" > WT2_Jarid2.sam
samtools view -h D4_report/outs/gex_possorted_bam.bam "chr13:44731271-44921643" > D4_Jarid2.sam
samtools view -h D8_report/outs/gex_possorted_bam.bam "chr13:44731271-44921643" > D8_Jarid2.sam
samtools view -h D21_report/outs/gex_possorted_bam.bam "chr13:44731271-44921643" > D21_Jarid2.sam

python ~/softwares/self_bin/dexseq_count.py /home/zz950/reference/refdata-cellranger-arc-mm10-2020-A-2.0.0/genes/genes.gff WT2_Jarid2.sam WT2_Jarid2.count.txt
python ~/softwares/self_bin/dexseq_count.py /home/zz950/reference/refdata-cellranger-arc-mm10-2020-A-2.0.0/genes/genes.gff D4_Jarid2.sam D4_Jarid2.count.txt
python ~/softwares/self_bin/dexseq_count.py /home/zz950/reference/refdata-cellranger-arc-mm10-2020-A-2.0.0/genes/genes.gff D8_Jarid2.sam D8_Jarid2.count.txt
python ~/softwares/self_bin/dexseq_count.py /home/zz950/reference/refdata-cellranger-arc-mm10-2020-A-2.0.0/genes/genes.gff D21_Jarid2.sam D21_Jarid2.count.txt

sed 's/\"//g' WT2_Jarid2.count.txt > WT2_Jarid2.count.clean.txt
sed 's/\"//g' D4_Jarid2.count.txt > D4_Jarid2.count.clean.txt
sed 's/\"//g' D8_Jarid2.count.txt > D8_Jarid2.count.clean.txt
sed 's/\"//g' D21_Jarid2.count.txt > D21_Jarid2.count.clean.txt

2023年03月16日

RNA Splicing双管齐下：

SUPPA2是以AS event为基础的分析，根据transcript的TPM来计算的；

DEXSeq则是以exon usage为基础的分析，直接对exon计数用DEseq处理，结果直观；

代码参考：http://localhost:17435/notebooks/tmpData/ApcKO_cellranger/psi/R_process.ipynb

# BiocManager::install("DEXSeq")
python ~/softwares/self_bin/dexseq_prepare_annotation.py genes.gtf.gz genes.gff

samtools view -h WT2_report/outs/gex_possorted_bam.bam "chr5:125017153-125179219" > WT2_Ncor2.sam
samtools view -h D21_report/outs/gex_possorted_bam.bam "chr5:125017153-125179219" > D21_Ncor2.sam

python ~/softwares/self_bin/dexseq_count.py /home/zz950/reference/refdata-cellranger-arc-mm10-2020-A-2.0.0/genes/genes.gff WT2_Ncor2.sam WT2_Ncor2.count.txt
python ~/softwares/self_bin/dexseq_count.py /home/zz950/reference/refdata-cellranger-arc-mm10-2020-A-2.0.0/genes/genes.gff D21_Ncor2.sam D21_Ncor2.count.txt

sed 's/\"//g' WT2_Ncor2.count.txt > WT2_Ncor2.count.clean.txt
sed 's/\"//g' D21_Ncor2.count.txt > D21_Ncor2.count.clean.txt

参考：Inferring differential exon usage in RNA-Seq data with the DEXSeq package

http://bioconductor.org/packages/release/bioc/html/DEXSeq.html 【教程已经非常清楚，需要耐心仔细的研读】

跟着tutorial走一遍【生信码农的基本功：工具测试评比】

关键问题：

gtf文件 - 必须是来自Ensembl的，否则没法正确运行，版本和格式是永恒的浪费时间的点
单细胞bam - 必须合并，不然出来的都是0，全部都检测不到
运行速度 - Python的wrapper是真的慢

更新：

之前一直模糊的概念，这里必须搞清楚！！！

read count

每个exon or exonic part的read count
每个样本的read count总和，即library size 测序深度
每个gene的read count总和，即该基因的所有transcript的read总数
relative usage of an exon，即 this / others，该exon的read总数除以比对到该gene其他exon的read总数 = 相对exon usage

问题：

把count数据转成CPM就能比较exon的CPM吗？ - 不能，exon层面就不能简单用gene那一套了，必须先关注这个gene本身是否有差异表达
relative usage of an exon的数据可比吗？ - 这就是校正了gene表达总量后的值，用来比较就更有意义了。

参考：

关于normalization - Introduction to DGE - ARCHIVED

Gene length corrected trimmed mean of M-values (GeTMM) processing of RNA-seq data performs similarly in intersample analyses while improving intrasample comparisons

先拆

ls archive_20201222/bam/BAM.sortedByCoord.out/*.bam | sed "s:^:`pwd`/: "  > all.bam.csv

cat all.bam.csv | grep "10c2" > 10c2.list
cat all.bam.csv | grep "17C8" > 17C8.list
cat all.bam.csv | grep "1C11" > 1C11.list
cat all.bam.csv | grep "20c7" > 20c7.list
cat all.bam.csv | grep "23C9" > 23C9.list
cat all.bam.csv | grep "3C15" > 3C15.list
cat all.bam.csv | grep "5c3" > 5c3.list
cat all.bam.csv | grep "6c5" > 6c5.list
cat all.bam.csv | grep "/IMR90" > IMR90.list
cat all.bam.csv | grep "IMR-N_Diff-D20" > IMR-N_Diff-D20.list
cat all.bam.csv | grep "IMR-N_Diff-D40" > IMR-N_Diff-D40.list
cat all.bam.csv | grep "iPSC-IMR90" > iPSC-IMR90.list
cat all.bam.csv | grep "iPSC-UE" > iPSC-UE.list
cat all.bam.csv | grep "/UE" | grep -v "N_Diff" > UE.list
cat all.bam.csv | grep "UE-N_Diff_D20" > UE-N_Diff_D20.list
cat all.bam.csv | grep "UE-N_Diff-D40" > UE-N_Diff-D40.list

split -d -n l/5 10c2.list 10c2.list
split -d -n l/5 17C8.list 17C8.list
split -d -n l/5 1C11.list 1C11.list
split -d -n l/5 20c7.list 20c7.list
split -d -n l/5 23C9.list 23C9.list
split -d -n l/5 3C15.list 3C15.list
split -d -n l/5 5c3.list 5c3.list
split -d -n l/5 6c5.list 6c5.list
split -d -n l/5 IMR90.list IMR90.list
split -d -n l/5 IMR-N_Diff-D20.list IMR-N_Diff-D20.list
split -d -n l/5 IMR-N_Diff-D40.list IMR-N_Diff-D40.list
split -d -n l/5 iPSC-IMR90.list iPSC-IMR90.list
split -d -n l/5 iPSC-UE.list iPSC-UE.list
split -d -n l/5 UE.list UE.list
split -d -n l/5 UE-N_Diff_D20.list UE-N_Diff_D20.list
split -d -n l/5 UE-N_Diff-D40.list UE-N_Diff-D40.list

合并

samtools merge -b UE.bam.list -@ 2 -r -p UE.bam
samtools merge -b UE.bam.list -@ 2 -r -p UE.bam

批量【先合并两个测试一下】

for i in `ls {IMR90.list0*,17C8.list0*}`
do
echo $i
samtools merge -b $i -@ 2 -r -p ${i}.bam &&\
echo "done!!!"
done

准备特殊的注释文件 & 计数

gencode版本

# gencode
# ~/databases/hg19/
python ~/softwares/R_lib_361/DEXSeq/python_scripts/dexseq_prepare_annotation.py gencode.v27.annotation.gtf gencode.v27.annotation.DEXSeq.chr.gff

# ~/project/scRNA-seq/rawData/smart-seq/archive_20201222/bam/BAM.sortedByCoord.out/IMR90-E2-A17_STAR_Aligned.sortedByCoord.out.bam
python ~/softwares/R_lib_361/DEXSeq/python_scripts/dexseq_count.py -p yes -f bam ~/databases/hg19/gencode.v27.annotation.DEXSeq.chr.gff ~/project/scRNA-seq/rawData/smart-seq/merged.bams/bam/IMR90.list00.bam tmp.result.txt

ensemble版本

############################################
# ensemble
# ~/references/SUPPA2/ref/hg19/
python ~/softwares/R_lib_361/DEXSeq/python_scripts/dexseq_prepare_annotation.py Homo_sapiens.GRCh37.75.formatted.gtf Homo_sapiens.GRCh37.75.DEXSeq.chr.gff

python ~/softwares/R_lib_361/DEXSeq/python_scripts/dexseq_count.py -f bam ~/references/SUPPA2/ref/hg19/Homo_sapiens.GRCh37.75.DEXSeq.chr.gff ~/project/scRNA-seq/rawData/smart-seq/merged.bams/bam/IMR90.list00.bam tmp.result.txt

批量运行

# 批量运行
for i in `ls ~/project/scRNA-seq/rawData/smart-seq/merged.bams/bam/*.bam`
do
python ~/softwares/R_lib_361/DEXSeq/python_scripts/dexseq_count.py -f bam ~/references/SUPPA2/ref/hg19/Homo_sapiens.GRCh37.75.DEXSeq.chr.gff $i ${i}.result.txt &&\
echo $i
done

文件如下：

chr15   dexseq_prepare_annotation.py    aggregate_gene  72491370        72524164        .       -       .
       gene_id "ENSG00000067225"
chr15   dexseq_prepare_annotation.py    exonic_part     72491370        72491372        .       -       .
       gene_id "ENSG00000067225"; transcripts "ENST00000564440+ENST00000319622"; exonic_part_number "001"
chr15   dexseq_prepare_annotation.py    exonic_part     72491373        72491376        .       -       .
       gene_id "ENSG00000067225"; transcripts "ENST00000319622+ENST00000565184+ENST00000389093+ENST00000335181+ENST00000568883+ENST00000564440"; exonic_part_number "002"
chr15   dexseq_prepare_annotation.py    exonic_part     72491377        72491445        .       -       .
       gene_id "ENSG00000067225"; transcripts "ENST00000319622+ENST00000565184+ENST00000389093+ENST00000335181+ENST00000565143+ENST00000568883+ENST00000564440"; exonic_part_number "003"
chr15   dexseq_prepare_annotation.py    exonic_part     72491446        72491930        .       -       .
       gene_id "ENSG00000067225"; transcripts "ENST00000567118+ENST00000319622+ENST00000565184+ENST00000389093+ENST00000335181+ENST00000449901+ENST00000565143+ENST00000568883+ENST00000564440"; exonic_part_number "004"

很明确，每一个都是pseudo exon【有时候会把exon切得非常碎，需要借助可视化工具来操作】

# check
# https://www.ncbi.nlm.nih.gov/gene/5315
PKM ENSG00000067225

总结：

对于大型bam文件来说，这个计数实在是太慢了，需要耐心等待
操作明确，对于pseudo exon进行read的计数，然后用已有的成熟工具来做差异分析
该工具画图居然需要root权限，实在是匪夷所思

可以作为一个AS分析的备选工具。

参考分析目录：

CGS: project/scPipeline/AS/DEXSeq/DEXSeq.ipynb

CGS: project/scRNA-seq/rawData/smart-seq/merged.bams/

待续~

posted @ 2021-04-29 21:45 Life·Intelligence 阅读(734) 评论(0) 编辑收藏举报

刷新页面返回顶部

Digital-LI

可变剪切 | isoform | 提取特定exon的usage | DEXSeq

2023年06月19日

2023年03月16日