使用cistrome BETA整合ChIPseq和RNAseq
代码参考:/home/zz950/projects/cut_run/HDAC-b2/pipeline/GSEA_BETA_integration
写在前面:
在获得同一个样本多种测序数据后,一个自然的目标就是整合,general的问题就是:表观是如何影响转录的?
基本的数据种类:
- TF binding,ChIP-seq和Cut&Run
- Histone profile,ChIP-seq和Cut&Run
- Open chromatin,ATAC-seq
- Gene expression,RNA-seq
具体的问题就是:
- 表观转录调控是如何进行的?
- 能否用表观数据来准确预测GEX?需要大样本或者单细胞!
- 能否搞清楚DEG是如何被该TF调控的?涉及到哪些pathway?有哪些co-factor?
BETA功能:
BETA has three functions:
- (i) to predict whether the factor has activating or repressive function
- (ii) to infer the factor’s target genes
- (iii) to identify the motif of the factor and its collaborators, which might modulate the factor’s activating or repressive function.
用于预测转录因子具有激活还是抑制的功能;
推断识别转录因子的直接靶基因;
用于鉴定转录因子的motif及其结合者。
BETA-basic can be used to predict whether a factor has activating or repressive function and detect direct target genes.【这个手动就能做出来,就是个peak注释,然后用DEG来分】
BETA-plus can be used to predict whether a factor has activating or repressive function, whether it can detect direct target genes and whether it can analyze sequence motifs in target regions.【这个的统计分析有一点高端】
Both binding and differential expression data are required for BETA-basic and BETA-plus, whereas BETA-minus is used when only binding data are available to predict target genes.【这个就不用说了,chipseeker,chip-anno, great都可以坐】
优点:
- BETA能给你把一切都rank好,而且给出P-value,你自己做则有一点麻烦
缺点:
- BETA基本不涉及到算法,也没什么fancy的地方
BETA输入:
- 文件一:peak文件,bed格式,我觉得这个必须用DAP的peak
- 文件二:DEG文件,需要最基本的gene name,log2fc,P-value
代码:
安装
conda create -y -n beta_chip python=2.7.15 conda activate beta_chip # conda install -y -c hcc beta # conda install -y libiconv pip install argparse pip install numpy # download from http://cistrome.org/BETA/src/BETA_1.0.7.zip python setup.py install
BETA basic \ -p 3656_peaks.bed \ -e AR_diff_expr.xls \ -k LIM \ -g hg19 \ --da 500 \ -n AR \ -o basic_output_dir
BETA plus \ -p 3656_peaks.bed \ -e AR_diff_expr.xls \ -k LIM \ -g hg19 \ --gs /home/zz950/reference/refdata-gex-GRCh38-2020-A/fasta/genome.fa \ -n AR \ -o plus_output_dir
自定义的三列DEG格式
BETA basic \ -p 3656_peaks.bed \ -e AR_diff_expr.3c.xls \ -k BSF \ --info 1,2,3 \ -g hg19 \ --da 500 \ -n AR \ -o basic_output_dir
gene symbol转RefSeq ID
gene.ID.df <- read.csv("https://github.com/leezx/RToolbox/raw/master/data/HGNC-RefSeq-Ensembl.txt",header = T,sep = "\t") gene.ID.df <- gene.ID.df[!duplicated(gene.ID.df$Approved.symbol),] gene.ID.df <- subset(gene.ID.df, Approved.symbol!="" & RefSeq.IDs!="") rownames(gene.ID.df) <- gene.ID.df$Approved.symbol tmp.DEG$RefSeq <- gene.ID.df[rownames(tmp.DEG),]$RefSeq.IDs tmp.DEG <- subset(tmp.DEG, !is.na(RefSeq)) tmp.DEG <- tmp.DEG[,c("RefSeq","Log2FC","Pvalue")] write.table(tmp.DEG, file = "keyRdata/RefSeq.HT115_M60-high.3C.txt", quote = F, row.names = F, col.names = F, sep = "\t")
测试数据:http://cistrome.org/BETA/src/BETA_test_data.zip
# https://genome-euro.ucsc.edu/ wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr?.fa.gz' wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr??.fa.gz' zcat chr* > hg38.ucsc.fa
如何产生有意义的结果?
- 用DAP和DEG,用diff来预测diff;
- DAP分gain和loss两类,分别来看;
- 其他peak文件也可以用BETA来分析,注意分gain和loss;
BETA预测出来的是direct target genes,不是间接的,理论上RNA-seq的DEG都算是HDAC1/2i的靶基因。
核心结果解读:
The red and the purple lines represent the upregulated and downregulated genes, respectively. The dashed line indicates the nondifferentially expressed genes as background. Genes are cumulated by the rank on the basis of the regulatory potential score from high to low. P values that represent the significance of the UP or DOWN group distributions are compared with the NON group by the Kolmogorov-Smirnov test.
原理:
看原始文献
用peak来predict DEG
monotonically decreasing function that is based on the distance between the binding site and transcription start site - 这不太行,一个distal enhancer可以10x GEX表达量,递减不合适
核心逻辑:gene expression changes associated with factor binding can give better confidence that a gene is a direct target
先看激活或抑制功能的预测。
首先,把RNA-seq获得的差异基因按照表达模式分成了上调,下调和不变三个组。又把ChIP-seq对应的基因按照regulatory potential score进行从高到低的排序。然后分别做上调,下调和不变基因的累计分布曲线。
什么是regulatory potential score?
这个得分有一个具体的公式,就是Sg。每个基因都有一个调控潜能得分,在默认的100kb范围内,所有靠近基因转录起始位点(g)的结合位点(k)都会被考虑。k表示的是结合位点的数量,也就是peak的数量。∆是结合位点与TSS(转录起始位点)之间的精确距离,与100 kb成比例(∆ = 0.1表示精确距离= 10 kb)。4表示的是随着距离增加,调控力衰减的快慢程度的一个常数。从这个公式里我们可以看到,当结合位点和TSS距离最小是0 的时候,调控潜能最大,是e-0.5。随着距离增加,调控潜能变小。
怎么通过这个图,累积分布曲线看出是激活或者抑制功能。这是前列腺癌细胞系LNCaP里AR的ChIP-seq做的图。这个图的三个线,红线蓝线和虚线分别表示RNA-seq得到的上调,下调和不变的基因,无差异表达的基因作为背景。横坐标是对调控潜能从高到低进行的排序,是一个rank。Y轴是基因的累积分布比例。整个曲线是一个CDF。假设10000这里,蓝色曲线对应的蓝线是30%左右,而红色对应的是70%左右,也就是说70%的上调基因都具有较高的调控潜能,从整个曲线可以看出来,红色曲线同一个X轴对应的Y值都比较大。所以说明AR是起到激活作用。而这两个图,Tet1在小鼠ES细胞中具有抑制功能(图3a),ESR1在乳腺癌细胞株MCF-7中具有激活和抑制功能。再来看一个例子,这个表格,假设有10个基因,3个上调基因,3个下调基因还有4个表达量不变的基因,对每个基因进行计算,发现上调基因的调控潜势普遍排在最前面,这样就得出这个转录因子具有激活的作用。
这个图是直接靶基因预测的结果,最后一列是靶基因名称。每个基因都有两个rank,一个是基于调控潜能Rgb,一个是基于差异表达Rge,然后计算这两个的乘积。假设有n个基因表达都有差异,而且调控潜能都大于0,就是说至少有一个结合事件,每个基因有两个rank,一个是递减的调控潜能Rgb,一个是增加的FDR或者P值,Rge,然后计算乘积。RP可以看成是一个P值。可以通过具体的cutoff来判断靶标,或者根据具体排序来判断靶标。
第三个功能是转录因子和协作子的motif预测
这两个图展示的是AR上调靶标区域的结合motif分析的截图和ESR1下调靶标区域的结合的motif分析。因为前面已经预测过AR是具有激活功能,所以这边关注的是上调结合的motif。ESR1也是一样的。最后一列提供了组中最显著的因子的motif logo。motif名称、dna结合域和物种显示在前三列。
这个是原理图,先找到结合位点的summit,然后在summit上下扩宽100bp,作为middle200bp目标区域,然后再往上下游各拓宽200bp,作为背景,以此来看在目标区域显著富集的motif。
能否拓展到其他peak数据?
- Histone profile
- Open chromatin
局限:
仍然是在找association,因为引入了更可靠的假设,所以结果将会更加可靠。
- First, factor-binding sites and target genes usually lack a one-to-one relationship. The same factor could bind anywhere between the proximal promoter to hundreds of kilobases downstream to regulate gene expression.
- Alternatively, the same binding site could regulate multiple genes by interacting with different promoters in different subpopulations of cells.
- Second, not all factor-binding sites found in a ChIP-seq experiment are functional, potentially owing to the lack of collaborating factors or conditions favorable to their function.
- Finally, the binding of one factor may cause secondary effects owing to transcriptional changes of its direct targets.
其他工具:
- GREAT target analysis tool - peak注释
参考:
- 文章:Target analysis by integration of transcriptome and ChIP-seq data with BETA - Nat Protoc - 2014
- 教程:BETA Input Files Format Description
- BETA整合ChIPseq和RNAseq
- 神技能,让你手里的ChIP-seq和RNA-seq数据1+1>2
- 大牛新宠,转录调控研究神器!找关键转录因子、推测转录因子-靶基因关系
- 案例文章:Transcriptional landscape of the human cell cycle - PNAS