手动计算富集分析

富集分析非常常见，用于判断抽样的结果是否显著。

例子1：一个工厂总共有N件产品，其中M件次品，现在从中抽取n件做检查，抽到k件次品的概率分布服从超几何分布。

例子2：一个细胞有N个基因，其中在pathway A里面有M个基因，现在从中抽取n个基因，抽到k个pathway A里基因的概率分布服从超几何分布。

最靠谱的富集分析当属clusterProfiler，里面的enrichGO可以做富集分析。

现在我有个性化的需求，所以要自己做，想借鉴一下里面的代码。

查看enrichGO代码，发现是里面的enricher_internal在做这件事，去GitHub查，发现enricher_internal是DOSE的函数，继续去GitHub查，定位到enricher_internal。

代码一览无余：

termID2ExtID <- termID2ExtID[idx]
qTermID2ExtID <- qTermID2ExtID[idx]
qTermID <- unique(names(qTermID2ExtID))
 
## prepare parameter for hypergeometric test
k <- sapply(qTermID2ExtID, length)
k <- k[qTermID]
M <- sapply(termID2ExtID, length)
M <- M[qTermID]
 
N <- rep(length(extID), length(M))
## n <- rep(length(gene), length(M)) ## those genes that have no annotation should drop.
n <- rep(length(qExtID2TermID), length(M))
args.df <- data.frame(numWdrawn=k-1, ## White balls drawn
                      numW=M,        ## White balls
                      numB=N-M,      ## Black balls
                      numDrawn=n)    ## balls drawn
 
 
## calcute pvalues based on hypergeometric model
pvalues <- apply(args.df, 1, function(n)
                 phyper(n[1], n[2], n[3], n[4], lower.tail=FALSE)
                 )
 
## gene ratio and background ratio
GeneRatio <- apply(data.frame(a=k, b=n), 1, function(x)
                   paste(x[1], "/", x[2], sep="", collapse="")
                   )
BgRatio <- apply(data.frame(a=M, b=N), 1, function(x)
                 paste(x[1], "/", x[2], sep="", collapse="")
                 )
 
 
Over <- data.frame(ID = as.character(qTermID),
                   GeneRatio = GeneRatio,
                   BgRatio = BgRatio,
                   pvalue = pvalues,
                   stringsAsFactors = FALSE)
 
p.adj <- p.adjust(Over$pvalue, method=pAdjustMethod)
qobj <- tryCatch(qvalue(p=Over$pvalue, lambda=0.05, pi0.method="bootstrap"), error=function(e) NULL)
 
if (class(qobj) == "qvalue") {
    qvalues <- qobj$qvalues
} else {
    qvalues <- NA
}
 
geneID <- sapply(qTermID2ExtID, function(i) paste(i, collapse="/"))
geneID <- geneID[qTermID]
Over <- data.frame(Over,
                   p.adjust = p.adj,
                   qvalue = qvalues,
                   geneID = geneID,
                   Count = k,
                   stringsAsFactors = FALSE)
 
Description <- TERM2NAME(qTermID, USER_DATA)
 
if (length(qTermID) != length(Description)) {
    idx <- qTermID %in% names(Description)
    Over <- Over[idx,]
}
Over$Description <- Description
nc <- ncol(Over)
Over <- Over[, c(1,nc, 2:(nc-1))]
 
 
Over <- Over[order(pvalues),]
 
 
Over$ID <- as.character(Over$ID)
Over$Description <- as.character(Over$Description)
 
row.names(Over) <- as.character(Over$ID)
 
x <- new("enrichResult",
         result         = Over,
         pvalueCutoff   = pvalueCutoff,
         pAdjustMethod  = pAdjustMethod,
         qvalueCutoff   = qvalueCutoff,
         gene           = as.character(gene),
         universe       = extID,
         geneSets       = geneSets,
         organism       = "UNKNOWN",
         keytype        = "UNKNOWN",
         ontology       = "UNKNOWN",
         readable       = FALSE
         )
return (x)

核心就是这个代码了：

args.df <- data.frame(numWdrawn=k-1, ## White balls drawn
                  numW=M,        ## White balls
                  numB=N-M,      ## Black balls
                  numDrawn=n)    ## balls drawn
 
 
## calcute pvalues based on hypergeometric model
pvalues <- apply(args.df, 1, function(n)
             phyper(n[1], n[2], n[3], n[4], lower.tail=FALSE)
             )

phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)

其中：

第一个参数：q
vector of quantiles representing the number of white balls drawn without replacement from an urn which contains both black and white balls.

第二个参数：m
the number of white balls in the urn.

第三个参数：n
the number of black balls in the urn.

第四个参数：k
the number of balls drawn from the urn.

ID Description GeneRatio BgRatio pvalue p.adjust qvalue
GO:0008380 RNA splicing 68/854 364/23210 6.07E-29 2.70E-25 2.28E-25

关键的只有两个数：

GeneRatio：我输入的基因数854（n），其中在pathway A里的有68个（k）

BgRatio：总共背景（有注释）基因数23210（N），其中pathway A里的基因数364个（M）

phyper(k-1, M, N-M, n, lower.tail = TRUE, log.p = FALSE)

带入数字：

phyper(67, 364, 23210-364, 854, lower.tail = TRUE, log.p = FALSE) = 6.07163831922482e-29

结果一致。

进阶：

lower.tail是什么
为什么第一个参数要减1

参考：

超几何分布检验（hypergeometric test）

https://github.com/YuLab-SMU/DOSE/blob/5be8e21c56242d58b5576c20412b3457bc61dae7/R/enricher_internal.R

不靠谱的富集软件太多了！

posted @ 2020-03-17 14:14 Life·Intelligence 阅读(1805) 评论(0) 编辑收藏举报

刷新页面返回顶部

（评论功能已被禁用）

2025年3月

日

一

二

三

四

五

六

Digital-LI

手动计算富集分析

搜索

我的标签

积分与排名

阅读排行榜