limma,edgeR,DESeq2三大包基本是做转录组差异分析的金标准,大多数转录组的文章都是用这三个R包进行差异分析。
edgeR差异分析速度快,得到的基因数目比较多,假阳性高(实际不差异结果差异)。
DESeq2差异分析速度慢,得到的基因数目比较少,假阴性高(实际差异结果不差异)。
需要注意的是制作分组信息的因子向量时,因子水平的前后顺序,在R的很多模型中,默认将因子向量的第一个水平看作对照组。
logFC值:
limma使用对数化的表达矩阵作为输入,所以将零假设指定在平均log值(对数几何平均数)。
edgeR和DESeq2使用原始的count矩阵作为输入,所以将原假设指定在count的平均值上(算术平均数)。
logFC一般是在1.2到2之间,筛选到差异基因的数目在500-1000左右为宜•
根据logFC的统计指标确定阈值的方法是计算logFC绝对值的平均数与2倍标准差之和。(正态分布)
DEG表示差异表达基因,differential gene express
0. 准备
判断式安装R包:如果该R包存在,可以顺带加载该R包,不需要再次library。
1 2 3 4 5 6 7 8 9 10 | if (! require (stringr)) install.packages ( "stringr" ) # 常规R包 if (! require (ggplotify)) install.packages ( "ggplotify" ) # 热图转换为ggplot的类型 if (! require (patchwork)) install.packages ( "patchwork" ) # 拼图R包 if (! require (cowplot)) install.packages ( "cowplot" ) # 拼图R包 if (! require (DESeq2))BiocManager:: install ( "DESeq2" ) # 差异分析R包 if (! require (edgeR))BiocManager:: install ( "edgeR" ) # 差异分析R包 if (! require (limma))BiocManager:: install ( "limma" ) # 差异分析R包 rm (list = ls ()) load ( "TCGA-CHOLgdc.Rdata" ) table (group_list) |
差异分析输入数据:表达矩阵exp,分组信息group_list。
1. DESeq2
差异分析
1 2 3 4 5 6 7 8 9 | library (DESeq2) library (stringr) colData <- data.frame (row.names = colnames (exp), condition = group_list) # 列出每个样品是肿瘤样品还是正常样品 dds <- DESeqDataSetFromMatrix (countData = exp, colData = colData, design = ~ condition) %>% DESeq () # 将数据框转为DESeq2的数据集类型,然后用DESeq函数做差异分析 |
提取结果,两两比较
1 2 3 4 5 | res <- results (dds, contrast = c ( "condition" , rev ( levels (group_list)))) # 按设置的比较水平提取dds的数据(从DESeq分析中提取结果表,给出样本的基本均值,logFC及其标准误,检验统计量,p值和矫正后的p值)。rev表示调换顺序,levels(group_list)是因子水平 DEG <- res[ order (res$pvalue),] %>% as.data.frame () # 将res按照P值从小到大排序,并转换回数据框。order函数获取向量每个元素从小到大排序后的第几位 head (DEG) |
添加change列标记基因上调下调
1 2 3 4 5 6 7 | logFC_cutoff <- with (DEG, mean ( abs (log2FoldChange)) + 2* sd ( abs (log2FoldChange))) # 设置logFC的阈值,可以计算得出,也可以设置固定值。abs函数计算log2FoldChange的绝对值。均数+2倍标准差可以包含log2FoldChange的95%置信区间 k1 = (DEG$pvalue < 0.05)&(DEG$log2FoldChange < -logFC_cutoff) # 判断下调基因 k2 = (DEG$pvalue < 0.05)&(DEG$log2FoldChange > logFC_cutoff) # 判断上调基因 DEG$change = ifelse (k1, "DOWN" , ifelse (k2, "UP" , "NOT" )) # 标记上、下调基因 table (DEG$change) # 查看上下调基因的数量 head (DEG) DESeq2_DEG <- DEG # 保存DESeq2差异分析的结果 |
2. edgeR
差异分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | library (edgeR) library (stringr) # 准备count表达矩阵和分组信息,构建edgeR的DGEList对象 dge <- DGEList (counts = exp, group = group_list) # 过滤low count数据,保留在至少在两个样本中cpm值(Counts per million)大约1的基因。 # keep <- rowSums(cpm(dge) > 1 ) >= 2 # table(keep) # dge <- dge[keep, , keep.lib.sizes = FALSE] # 由于在上一步整理数据的过程中已经过滤了一遍数据了,所以这里不再过滤。 dge$samples$lib.size <- colSums (dge$counts) # 标准化基因表达分布,以TMM标准化为例。 dge <- calcNormFactors (dge) # calcNormFactors函数并不会标准化数据,只是计算标准化因子 # 假设数据符合正态分布,构建线性模型。 design <- model.matrix ( ~ 0 + group_list) # 0代表x线性模型的截距为0,需要修改 rownames (design) <- colnames (dge) colnames (design) <- levels (group_list) # 计算线性模型的参数 dge <- estimateGLMCommonDisp (dge,design) %>% estimateGLMTrendedDisp (design) %>% estimateGLMTagwiseDisp (design) fit <- glmFit (dge, design) %>% # 拟合线性模型 glmLRT (contrast = c (-1,1)) # 进行差异分析。(1,-1)意味着前比后 |
提取结果
1 2 3 4 5 6 7 8 9 10 11 12 | DEG = topTags (fit, n= nrow (exp)) %>% as.data.frame () # 提取差异分析结果 # 添加change列标记基因上调下调 logFC_cutoff <- with (DEG, mean ( abs (logFC)) + 2* sd ( abs (logFC))) # 设置logFC的阈值 # logFC_cutoff <- 2 k1 = (DEG$PValue < 0.05)&(DEG$logFC < -logFC_cutoff) # 判断下调基因 k2 = (DEG$PValue < 0.05)&(DEG$logFC > logFC_cutoff) # 判断上调基因 DEG$change = ifelse (k1, "DOWN" , ifelse (k2, "UP" , "NOT" )) # 标记上、下调基因 head (DEG) table (DEG$change) edgeR_DEG <- DEG |
3. limma
差异分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | library (limma) library (stringr) # 创建设计矩阵和对比。假设数据符合正态分布,构建线性模型 design <- model.matrix ( ~ 0 + group_list) colnames (design) = levels (group_list) rownames (design) = colnames (exp) dge <- DGEList (counts=exp) %>% # 将count表达矩阵转换为edgeR的DGEList对象 calcNormFactors () # 进行标准化和表达量计算。calcNormFactors函数并不会标准化数据,只是计算标准化因子。 # 标准化、线性模型拟合 fit <- voom (dge, design, normalize = "quantile" ) %>% # voom是limma中的一种标准化方法 lmFit (design) # 线性模型拟合,出图 # 设置需要进行对比的分组 comp = paste ( rev ( levels (group_list)), collapse = "-" ) cont.matrix <- makeContrasts (contrasts = comp, levels = design) # 比较每个基因 fit2 <- contrasts.fit (fit, cont.matrix) %>% # 根据对比模型进行差值计算 eBayes () # 贝叶斯检验 # 使用plotSA绘制了log2残差标准差与log-CPM均值的关系。平均log2残差标准差由水平蓝线标出 plotSA (fit2, main = "Final model: Mean-variance trend" ) |
提取结果
1 2 3 4 5 6 7 8 9 10 11 12 | DEG = topTable (fit2, coef = comp, n= Inf ) %>% # 提取差异分析结果 na.omit () # 去除差异分析结果中包含NA值的行 length ( which (DEG$adj.P.Val < 0.05)) # p值<0.05的基因有多少个 # 添加change列标记基因上调下调 logFC_cutoff <- with (DEG, mean ( abs (logFC)) + 2* sd ( abs (logFC))) # 设置logFC的阈值 k1 = (DEG$P.Value < 0.05)&(DEG$logFC < -logFC_cutoff) # 判断下调基因 k2 = (DEG$P.Value < 0.05)&(DEG$logFC > logFC_cutoff) # 判断上调基因 DEG$change = ifelse (k1, "DOWN" , ifelse (k2, "UP" , "NOT" )) # 标记上、下调基因 table (DEG$change) head (DEG) limma_voom_DEG <- DEG |
4. 数据检查
整合三大R包差异分析的数据:
1 2 3 4 5 | tj = data.frame (deseq2 = as.integer ( table (DESeq2_DEG$change)), edgeR = as.integer ( table (edgeR_DEG$change)), limma_voom = as.integer ( table (limma_voom_DEG$change)), row.names = c ( "down" , "not" , "up" ) );tj |
验证数据:
验证logFC是否由tumor/normal计算而来:
1 2 3 | boxplot ( as.integer (exp[ rownames (DESeq2_DEG)[1],]) ~ group_list) # 取exp的第一行基因表达量,以group_list为分组依据,做箱线图。修改“DESeq2_DEG”即可分别分析三种方法。 DESeq2_DEG$log2FoldChange[1] # 得到DESeq2_DEG的第一个logFC值。如果箱线图中normal大tumor小,则logFC应为负值,反之为正值。 # 如果检查发现logFC的正负值反了,则修改为负值即可:DESeq2_DEG$log2FoldChange = -DESeq2_DEG$log2FoldChange |
保存数据
1 | save (DESeq2_DEG, edgeR_DEG, limma_voom_DEG, group_list, tj, file = paste0 (cancer_type, "DEG.Rdata" )) |
5.三大R包差异分析可视化(热图、火山图)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | rm (list = ls ()) setwd ( "D:/WorkingDirectory/01.mRNA_Expdata" ) load ( "TCGA-CHOL_01.mRNA_Expdata_count.Rdata" ) # 加载表达数据整理结果 setwd ( "D:/WorkingDirectory/03.mRNA_Diff" ) load ( "TCGA-CHOL_DEG.Rdata" ) # 加载差异表达基因结果 if (! require (tinyarray))devtools:: install_local ( "tinyarray-master.zip" , upgrade = F) library (ggplot2) library (tinyarray) exp[1:4,1:4] # 原始count表达矩阵数据 dat = log2 (exp+1) pca.plot = draw_pca (dat, group_list); pca.plot # 作PCA图 save (pca.plot, file = paste0 (cancer_type, "_PCA.Plot.Rdata" )) # 保存PCA图数据以便后续拼图 # 提取差异基因 cg1 = rownames (DESeq2_DEG)[DESeq2_DEG$change != "NOT" ] cg2 = rownames (edgeR_DEG)[edgeR_DEG$change != "NOT" ] cg3 = rownames (limma_voom_DEG)[limma_voom_DEG$change != "NOT" ] # 画热图 h1 = draw_heatmap (dat[cg1,], group_list, scale_before = T) h2 = draw_heatmap (dat[cg2,], group_list, scale_before = T) h3 = draw_heatmap (dat[cg3,], group_list, scale_before = T) # 构建“均数+2倍标准差”函数 m2d = function (x){ mean ( abs (x))+2* sd ( abs (x)) } # 画火山图 v1 = draw_volcano (DESeq2_DEG, pkg = 1, logFC_cutoff = m2d (DESeq2_DEG$log2FoldChange)) v2 = draw_volcano (edgeR_DEG, pkg = 2, logFC_cutoff = m2d (edgeR_DEG$logFC)) v3 = draw_volcano (limma_voom_DEG, pkg = 3, logFC_cutoff = m2d (limma_voom_DEG$logFC)) # 拼图 library (patchwork) (h1 + h2 + h3) / (v1 + v2 + v3) + plot_layout (guides = "collect" ) # & theme(legend.position = "none") # +表示横向,/表示纵向,guides表示将图例统一放在右侧。 # &theme(legend.position = "none")表示不显示图例。 ggsave ( paste0 (cancer_type, "_Heat.Vo.png" ),width = 15, height = 10) |
6.三大R包差异基因对比(热图、韦恩图、PCA图)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | rm (list = ls ()) setwd ( "D:/WorkingDirectory/01.mRNA_Expdata" ) load ( "TCGA-CHOL_01.mRNA_Expdata_count.Rdata" ) setwd ( "D:/WorkingDirectory/03.mRNA_Diff" ) load ( "TCGA-CHOL_DEG.Rdata" ) load ( "TCGA-CHOL_PCA.Plot.Rdata" ) if (! require (tinyarray))devtools:: install_local ( "tinyarray-master.zip" , upgrade = F) library (ggplot2) library (tinyarray) # 构建提取上调基因的函数: UP = function (df){ rownames (df)[df$change == "UP" ] } # 构建提取下调基因的函数: DOWN = function (df){ rownames (df)[df$change == "DOWN" ] } # 取三大R包上调基因交集: up = intersect ( intersect ( UP (DESeq2_DEG), UP (edgeR_DEG)), UP (limma_voom_DEG)) # intersect函数表示取两个向量的交集。 # 取三大R包下调基因交集: down = intersect ( intersect ( DOWN (DESeq2_DEG), DOWN (edgeR_DEG)), DOWN (limma_voom_DEG)) # 取三大R包差异基因交集画热图: dat = log2 (exp+1) hp = draw_heatmap (dat[ c (up, down), ], group_list, scale_before = T) #上调、下调基因分别画维恩图: up_genes = list (Deseq2 = UP (DESeq2_DEG), edgeR = UP (edgeR_DEG), limma = UP (limma_voom_DEG)) down_genes = list (Deseq2 = DOWN (DESeq2_DEG), edgeR = DOWN (edgeR_DEG), limma = DOWN (limma_voom_DEG)) up.plot <- draw_venn (up_genes, "UPgene" ) down.plot <- draw_venn (down_genes, "DOWNgene" ) #维恩图拼图 library (patchwork) pca.plot + hp + up.plot + down.plot + plot_layout (guides = "collect" ) ggsave ( paste0 (cancer_type, "_Heat.Ve.PCA.png" ), width = 15, height = 10) |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· 物流快递公司核心技术能力-地址解析分单基础技术分享
· .NET 10首个预览版发布:重大改进与新特性概览!