单细胞数据高级分析之消除细胞周期因素 | Removal of cell cycle effect
The normalization method described above aims to reduce the effect of technical factors in scRNA-seq data (primarily, depth) from downstream analyses. However, heterogeneity in cell cycle stage, particularly among mitotic cells transitioning between S and G2/M phases, also can drive substantial transcriptomic variation that can mask biological signal. To mitigate this effect, we use a two-step approach:
1) quantify cell cycle stage for each cell using supervised analyses with known stage-specific markers,
2) regress the effect of cell cycle stage using the same negative binomial regression as outlined above.
For the first step we use a previously published list of cell cycle dependent genes (43S phase genes, 54 G2/M phase genes) for an enrichment analysis similar to that proposed in ref. 11.
For each cell, we compare the sum of phase-specific gene expression (log10 transformed UMIs) to the distribution of 100 random background genes sets, where the number of background genes is identical to the phase gene set, and the background genes are drawn from the same expression bins. Expression bins are defined by 50 non-overlapping windows of the same range based on log10(mean UMI). The phase-specific enrichment score is the expression z-score relative to the mean and standard deviation of the background gene sets. Our final ‘cell cycle score’ (Extended Data Fig. 1) is the difference between S-phase score and G2/M-phase score.
For a final normalized dataset with cell cycle effect removed, we perform negative binomial regression with technical factors and cell cycle score as predictors. Although the cell cycle activity was regressed out of the data for downstream analysis, we stored the computed cell cycle score before regression, enabling us to remember the mitotic phase of each individual cell. Notably, our regression strategy is tailored to mitigate the effect of transcriptional heterogeneity within mitotic cells in different phases, and should not affect global differences between mitotic and non-mitotic cells that may be biologically relevant.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | get.cc.score <- function (cm, N=100, seed=42) { set.seed (seed) cat ( 'get.cc.score, ' ) cat ( 'number of random background gene sets set to' , N, '\n' ) min.cells <- 5 cells.mols <- apply (cm, 2, sum) gene.cells <- apply (cm>0, 1, sum) cm <- cm[gene.cells >= min.cells, ] gene.mean <- apply (cm, 1, mean) breaks <- unique ( quantile ( log10 (gene.mean), probs = seq (0,1, length.out = 50))) gene.bin <- cut ( log10 (gene.mean), breaks = breaks, labels = FALSE ) names (gene.bin) <- rownames (cm) gene.bin[ is.na (gene.bin)] <- 0 regev.s.genes <- read.table (file= './annotation/s_genes.txt' , header= FALSE , stringsAsFactors= FALSE )$V1 regev.g2m.genes <- read.table (file= './annotation/g2m_genes.txt' , header= FALSE , stringsAsFactors= FALSE )$V1 goi.lst <- list ( 'S' = rownames (cm)[! is.na ( match ( toupper ( rownames (cm)), regev.s.genes))], 'G2M' = rownames (cm)[! is.na ( match ( toupper ( rownames (cm)), regev.g2m.genes))]) n <- min (40, min ( sapply (goi.lst, length))) goi.lst <- lapply (goi.lst, function (x) x[ order (gene.mean[x], decreasing = TRUE )[1:n]]) bg.lst <- list ( 'S' = get.bg.lists (goi.lst[[ 'S' ]], N, gene.bin), 'G2M' = get.bg.lists (goi.lst[[ 'G2M' ]], N, gene.bin)) all.genes <- sort ( unique ( c ( unlist (goi.lst, use.names= FALSE ), unlist (bg.lst, use.names= FALSE )))) expr <- log10 (cm[all.genes, ]+1) s.score <- enr.score (expr, goi.lst[[ 'S' ]], bg.lst[[ 'S' ]]) g2m.score <- enr.score (expr, goi.lst[[ 'G2M' ]], bg.lst[[ 'G2M' ]]) phase <- as.numeric (g2m.score > 2 & s.score <= 2) phase[g2m.score <= 2 & s.score > 2] <- -1 return ( data.frame (score=s.score-g2m.score, s.score, g2m.score, phase)) } |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· winform 绘制太阳,地球,月球 运作规律
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)