手动计算富集分析
富集分析非常常见,用于判断抽样的结果是否显著。
例子1:一个工厂总共有N件产品,其中M件次品,现在从中抽取n件做检查,抽到k件次品的概率分布服从超几何分布。
例子2:一个细胞有N个基因,其中在pathway A里面有M个基因,现在从中抽取n个基因,抽到k个pathway A里基因的概率分布服从超几何分布。
最靠谱的富集分析当属clusterProfiler,里面的enrichGO可以做富集分析。
现在我有个性化的需求,所以要自己做,想借鉴一下里面的代码。
查看enrichGO代码,发现是里面的enricher_internal在做这件事,去GitHub查,发现enricher_internal是DOSE的函数,继续去GitHub查,定位到enricher_internal。
代码一览无余:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | termID2ExtID <- termID2ExtID[idx] qTermID2ExtID <- qTermID2ExtID[idx] qTermID <- unique ( names (qTermID2ExtID)) ## prepare parameter for hypergeometric test k <- sapply (qTermID2ExtID, length) k <- k[qTermID] M <- sapply (termID2ExtID, length) M <- M[qTermID] N <- rep ( length (extID), length (M)) ## n <- rep(length(gene), length(M)) ## those genes that have no annotation should drop. n <- rep ( length (qExtID2TermID), length (M)) args.df <- data.frame (numWdrawn=k-1, ## White balls drawn numW=M, ## White balls numB=N-M, ## Black balls numDrawn=n) ## balls drawn ## calcute pvalues based on hypergeometric model pvalues <- apply (args.df, 1, function (n) phyper (n[1], n[2], n[3], n[4], lower.tail= FALSE ) ) ## gene ratio and background ratio GeneRatio <- apply ( data.frame (a=k, b=n), 1, function (x) paste (x[1], "/" , x[2], sep= "" , collapse= "" ) ) BgRatio <- apply ( data.frame (a=M, b=N), 1, function (x) paste (x[1], "/" , x[2], sep= "" , collapse= "" ) ) Over <- data.frame (ID = as.character (qTermID), GeneRatio = GeneRatio, BgRatio = BgRatio, pvalue = pvalues, stringsAsFactors = FALSE ) p.adj <- p.adjust (Over$pvalue, method=pAdjustMethod) qobj <- tryCatch ( qvalue (p=Over$pvalue, lambda=0.05, pi0.method= "bootstrap" ), error= function (e) NULL ) if ( class (qobj) == "qvalue" ) { qvalues <- qobj$qvalues } else { qvalues <- NA } geneID <- sapply (qTermID2ExtID, function (i) paste (i, collapse= "/" )) geneID <- geneID[qTermID] Over <- data.frame (Over, p.adjust = p.adj, qvalue = qvalues, geneID = geneID, Count = k, stringsAsFactors = FALSE ) Description <- TERM2NAME (qTermID, USER_DATA) if ( length (qTermID) != length (Description)) { idx <- qTermID % in % names (Description) Over <- Over[idx,] } Over$Description <- Description nc <- ncol (Over) Over <- Over[, c (1,nc, 2:(nc-1))] Over <- Over[ order (pvalues),] Over$ID <- as.character (Over$ID) Over$Description <- as.character (Over$Description) row.names (Over) <- as.character (Over$ID) x <- new ( "enrichResult" , result = Over, pvalueCutoff = pvalueCutoff, pAdjustMethod = pAdjustMethod, qvalueCutoff = qvalueCutoff, gene = as.character (gene), universe = extID, geneSets = geneSets, organism = "UNKNOWN" , keytype = "UNKNOWN" , ontology = "UNKNOWN" , readable = FALSE ) return (x) |
核心就是这个代码了:
1 2 3 4 5 6 7 8 9 10 | args.df <- data.frame (numWdrawn=k-1, ## White balls drawn numW=M, ## White balls numB=N-M, ## Black balls numDrawn=n) ## balls drawn ## calcute pvalues based on hypergeometric model pvalues <- apply (args.df, 1, function (n) phyper (n[1], n[2], n[3], n[4], lower.tail= FALSE ) ) |
phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)
其中:
第一个参数:q
vector of quantiles representing the number of white balls drawn without replacement from an urn which contains both black and white balls.
第二个参数:m
the number of white balls in the urn.
第三个参数:n
the number of black balls in the urn.
第四个参数:k
the number of balls drawn from the urn.
ID Description GeneRatio BgRatio pvalue p.adjust qvalue
GO:0008380 RNA splicing 68/854 364/23210 6.07E-29 2.70E-25 2.28E-25
关键的只有两个数:
GeneRatio:我输入的基因数854(n),其中在pathway A里的有68个(k)
BgRatio:总共背景(有注释)基因数23210(N),其中pathway A里的基因数364个(M)
phyper(k-1, M, N-M, n, lower.tail = TRUE, log.p = FALSE)
带入数字:
phyper(67, 364, 23210-364, 854, lower.tail = TRUE, log.p = FALSE) = 6.07163831922482e-29
结果一致。
进阶:
- lower.tail是什么
- 为什么第一个参数要减1
参考:
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· winform 绘制太阳,地球,月球 运作规律
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)