基于基因表达的打分模型
Cell cycle analysis was conducted using the cyclone function in the scran R package (39). For every cell cluster, each cell was assigned into G1, G2M and S phases.
可以学习一下,来自:PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data
根据基因表达来打分的需求有很多很多:
- 细胞周期打分
- 增殖指数 - 干细胞、癌细胞
- 迁移指数 - 某些特定细胞,如ENCC
- 代谢指数
参考文章:
Personalized Prediction of Proliferation Rates and Metabolic Liabilities in Cancer Biopsies
midbrain那篇Cell
LGE那篇Nature
Research Techniques Made Simple: Analysis of Collective Cell Migration Using the Wound Healing Assay
Cell Migration in Development and Disease
Increased cell proliferation and neurogenesis in the adult human Huntington's disease brain
做的人很多,现在想做必须加入一些新特征,cell type区分等。
cell cycle score
Removal of cell cycle effect.
The normalization method described above aims to reduce the effect of technical factors in scRNA-seq data (primarily, depth) from downstream analyses. However, heterogeneity in cell cycle stage, particularly among mitotic cells transitioning between S and G2/M phases, also can drive substantial transcriptomic variation that can mask biological signal. To mitigate this effect, we use a two-step approach:
1) quantify cell cycle stage for each cell using supervised analyses with known stage-specific markers,
2) regress the effect of cell cycle stage using the same negative binomial regression as outlined above.
For the first step we use a previously published list of cell cycle dependent genes (43 S phase genes, 54 G2/M phase genes) for an enrichment analysis similar to that proposed in ref. 11.
For each cell, we compare the sum of phase-specific gene expression (log10 transformed UMIs) to the distribution of 100 random background genes sets, where the number of background genes is identical to the phase gene set, and the background genes are drawn from the same expression bins. Expression bins are defined by 50 non-overlapping windows of the same range based on log10(mean UMI). The phase-specific enrichment score is the expression z-score relative to the mean and standard deviation of the background gene sets. Our final ‘cell cycle score’ (Extended Data Fig. 1) is the difference between S-phase score and G2/M-phase score.
除了要normalize测序深度,还要考虑细胞周期的阶段(总共有四个细胞周期stage),尤其是S期和G2/M期。这里首先定量哪些已知的细胞周期的marker (43 S phase genes, 54 G2/M phase genes),其次,比较总和/100个随机的背景基因;背景基因来自相同的bin(50 non-overlapping windows of the same range),最后计算了一个z-score。
note: A z-score for a sample indicates the number of standard deviations away from the mean of expression in the reference. The formula is : z = (expression in tumor sample - mean expression in reference sample) / standard deviation of expression in reference sample.
For a final normalized dataset with cell cycle effect removed, we perform negative binomial regression with technical factors and cell cycle score as predictors. Although the cell cycle activity was regressed out of the data for downstream analysis, we stored the computed cell cycle score before regression, enabling us to remember the mitotic phase of each individual cell. Notably, our regression strategy is tailored to mitigate the effect of transcriptional heterogeneity within mitotic cells in different phases, and should not affect global differences between mitotic and non-mitotic cells that may be biologically relevant.
声称自己的normalization只去除了mitotic cells内部的差异。
# input S markers, and N=100, and gene bin (all genes are divided to 33 bins) get.bg.lists <- function(goi, N, expr.bin) { res <- list() # check the marker local in which bin, and table the count goi.bin.tab <- table(expr.bin[goi]) for (i in 1:N) { # each time, random select 38 background genes and merge res[[i]] <- unlist(lapply(names(goi.bin.tab), function(b) { # find the genes in this bin & genes not in the markers sel <- which(expr.bin == as.numeric(b) & !(names(expr.bin) %in% goi)) # random sample, count corespond to the number of marker sample(names(expr.bin)[sel], goi.bin.tab[b]) })) } return(res) } # input log2 expr, S markers, 100 background genes enr.score <- function(expr, goi, bg.lst) { # S markers mean of each cell goi.mean <- apply(expr[goi, ], 2, mean) # get background genes mean by 100 samples bg.mean <- sapply(1:length(bg.lst), function(i) apply(expr[bg.lst[[i]], ], 2, mean)) # return z-score return((goi.mean - apply(bg.mean, 1, mean)) / apply(bg.mean, 1, sd)) } get.cc.score <- function(cm, N=100, seed=42) { set.seed(seed) cat('get.cc.score, ') cat('number of random background gene sets set to', N, '\n') min.cells <- 5 # min detected gene number cells.mols <- apply(cm, 2, sum) # total umi gene.cells <- apply(cm>0, 1, sum) # total detected gene number cm <- cm[gene.cells >= min.cells, ] # filter gene.mean <- apply(cm, 1, mean) # gene mean # divide all genes to 50 bins breaks <- unique(quantile(log10(gene.mean), probs = seq(0,1, length.out = 50))) gene.bin <- cut(log10(gene.mean), breaks = breaks, labels = FALSE) # table(gene.bin) to see count names(gene.bin) <- rownames(cm) gene.bin[is.na(gene.bin)] <- 0 regev.s.genes <- read.table(file='./annotation/s_genes.txt', header=FALSE, stringsAsFactors=FALSE)$V1 regev.g2m.genes <- read.table(file='./annotation/g2m_genes.txt', header=FALSE, stringsAsFactors=FALSE)$V1 # list for go, mouse to human goi.lst <- list('S'=rownames(cm)[!is.na(match(toupper(rownames(cm)), regev.s.genes))], 'G2M'=rownames(cm)[!is.na(match(toupper(rownames(cm)), regev.g2m.genes))]) n <- min(40, min(sapply(goi.lst, length))) # 38 # sort the marker by their expression goi.lst <- lapply(goi.lst, function(x) x[order(gene.mean[x], decreasing = TRUE)[1:n]]) # double list, background gene list bg.lst <- list('S'=get.bg.lists(goi.lst[['S']], N, gene.bin), 'G2M'=get.bg.lists(goi.lst[['G2M']], N, gene.bin)) # merge all genes, markers + background, sort by name all.genes <- sort(unique(c(unlist(goi.lst, use.names=FALSE), unlist(bg.lst, use.names=FALSE)))) expr <- log10(cm[all.genes, ]+1) # s.score <- enr.score(expr, goi.lst[['S']], bg.lst[['S']]) g2m.score <- enr.score(expr, goi.lst[['G2M']], bg.lst[['G2M']]) phase <- as.numeric(g2m.score > 2 & s.score <= 2) phase[g2m.score <= 2 & s.score > 2] <- -1 return(data.frame(score=s.score-g2m.score, s.score, g2m.score, phase)) }
问题:
1. 这确实是一种打分模型,但是如何评估模型是否准确呢?
想构建一个打分模型,用于scRNA-seq的细胞类型打分,以及bulk RNA-seq的细胞组成成分预测。
其实有篇文章已经做过这种工作了,做了第一部分,第二部分还没有评估。