SC3聚类 | 拉普拉斯矩阵 | Laplacian matrix | 图论 | R代码

Laplacian和PCA貌似是同一种性质的方法,坐标系变换。只是拉普拉斯属于图论的范畴,术语更加专业了。

 

要看就把一篇文章看完整,再看其中有什么值得借鉴的,总结归纳理解后的东西才是属于你的。

问题:

1. 这篇文章有哪些亮点决定他能发NM?单细胞,consensus,较好的表现,包装了一些专业的术语,显得自己很专业,其实真正做的东西很少;

2. consensus方法的本质是什么?

3. 工具的评估准则?ARI,silhouette index

4. SC3的最大缺点是什么?速度太慢,超过1000个细胞就非常耗费计算和存储资源

5. 能看懂SC3这个R包的逻辑吗?核心的就4步,多种距离度量,转换,kmeans聚类,consensus;

The main sc3 method explained above is a wrapper that calls several other SC3 methods in the following order:

  • sc3_prepare
  • sc3_estimate_k - Tracy-Widom theory - random matrix theory (RMT)
  • sc3_calc_dists
  • sc3_calc_transfs
  • sc3_kmeans
  • sc3_calc_consens
  • sc3_calc_biology

6. 有很多问题没有回答,这篇文章偏技工!核心就是kmeans,打了个复杂的包而已。

  • 不同距离的度量有什么差异?
  • 为什么要做两种转换PCA和laplacian?
  • 为什么选择了kmeans?不知道它有天然的劣势吗
  • 做consensus的理论依据是什么?凭什么说做了一致性后结果就更好?

 

 


 

最近在看SC3聚类这篇文章,SC3使用了这个工具。

SC3: consensus clustering of single-cell RNA-seq data

All distance matrices are then transformed using either principal component analysis (PCA) or by calculating the eigenvectors of the associated graph Laplacian (L = I – D–1/2AD–1/2, where I is the identity matrix, A is a similarity matrix (A = e–A′/max(A′)), where A′ is a distance matrix) and D is the degree matrix of A, a diagonal matrix that contains the row-sums of A on the diagonal (Dii = ΣjAij). The columns of the resulting matrices are then sorted in ascending order by their corresponding eigenvalues.

先看下该工具的功能:SC3 package manual

跑一下常规代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
library(SingleCellExperiment)
library(SC3)
library(scater)
 
head(ann)
yan[1:3, 1:3]
 
# create a SingleCellExperiment object
sce <- SingleCellExperiment(
  assays = list(
    counts = as.matrix(yan),
    logcounts = log2(as.matrix(yan) + 1)
  ),
  colData = ann
)
 
# define feature names in feature_symbol column
rowData(sce)$feature_symbol <- rownames(sce)
# remove features with duplicated names
sce <- sce[!duplicated(rowData(sce)$feature_symbol), ]
 
# define spike-ins
isSpike(sce, "ERCC") <- grepl("ERCC", rowData(sce)$feature_symbol)
 
plotPCA(sce, colour_by = "cell_type1")
 
sce <- sc3(sce, ks = 2:4, biology = TRUE)
# sc3_interactive(sce)
# sc3_export_results_xls(sce)
 
######################################
sce <- sc3_prepare(sce)
 
sce <- sc3_estimate_k(sce)
 
sce <- sc3_calc_dists(sce)
names(metadata(sce)$sc3$distances)
 
sce <- sc3_calc_transfs(sce)
names(metadata(sce)$sc3$transformations)
metadata(sce)$sc3$distances
 
sce <- sc3_kmeans(sce, ks = 2:4)
names(metadata(sce)$sc3$kmeans)
 
col_data <- colData(sce)
head(col_data[ , grep("sc3_", colnames(col_data))])
sce <- sc3_calc_consens(sce)
names(metadata(sce)$sc3$consensus)
names(metadata(sce)$sc3$consensus$`3`)
 
col_data <- colData(sce)
head(col_data[ , grep("sc3_", colnames(col_data))])
 
sce <- sc3_calc_biology(sce, ks = 2:4)
 
sce <- sc3_run_svm(sce, ks = 2:4)
col_data <- colData(sce)
head(col_data[ , grep("sc3_", colnames(col_data))])

  

接下来会尝试拆一下该工具。

怎么拆这个工具?

这种封装的很好的R包其实比较难拆,一般的通过函数名字就可以看到R代码,但这里你输入函数名,如sc3_calc_dists,看到的只是以下的封装好的代码:

1
2
3
4
5
6
7
8
new("nonstandardGenericFunction", .Data = function (object)
{
    standardGeneric("sc3_calc_dists")
}, generic = structure("sc3_calc_dists", package = "SC3"), package = "SC3",
    group = list(), valueClass = character(0), signature = "object",
    default = NULL, skeleton = (function (object)
    stop("invalid call in method dispatch to 'sc3_calc_dists' (no default method)",
        domain = NA))(object))  

暂时还不熟悉这种形式,所以只能通过函数名去GitHub里面查了。

GitHub真的很优秀,可以直接查文件内部代码,可以很快定位到sc3_calc_dists。

再配合这个目录插件,效率提高了不少,https://www.octotree.io/?utm_source=lite&utm_medium=extension

以下是封装前的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#' Calculate distances between the cells.
#'
#' This function calculates distances between the cells. It
#' creates and populates the following items of the \code{sc3} slot of the \code{metadata(object)}:
#' \itemize{
#'   \item \code{distances} - contains a list of distance matrices corresponding to
#'   Euclidean, Pearson and Spearman distances.
#' }
#'
#' @name sc3_calc_dists
#' @aliases sc3_calc_dists, sc3_calc_dists,SingleCellExperiment-method
#'
#' @param object an object of \code{SingleCellExperiment} class
#'
#' @return an object of \code{SingleCellExperiment} class
#'
#' @importFrom doRNG %dorng%
#' @importFrom foreach foreach %dopar%
#' @importFrom parallel makeCluster stopCluster
#' @importFrom doParallel registerDoParallel
sc3_calc_dists.SingleCellExperiment <- function(object) {
    dataset <- get_processed_dataset(object)
     
    # check whether in the SVM regime
    if (!is.null(metadata(object)$sc3$svm_train_inds)) {
        dataset <- dataset[, metadata(object)$sc3$svm_train_inds]
    }
     
    # NULLing the variables to avoid notes in R CMD CHECK
    i <- NULL
     
    distances <- c("euclidean", "pearson", "spearman")
     
    message("Calculating distances between the cells...")
     
    if (metadata(object)$sc3$n_cores > length(distances)) {
        n_cores <- length(distances)
    } else {
        n_cores <- metadata(object)$sc3$n_cores
    }
     
    cl <- parallel::makeCluster(n_cores, outfile = "")
    doParallel::registerDoParallel(cl, cores = n_cores)
     
    # calculate distances in parallel
    dists <- foreach::foreach(i = distances) %dorng% {
        try({
            calculate_distance(dataset, i)
        })
    }
     
    # stop local cluster
    parallel::stopCluster(cl)
     
    names(dists) <- distances
     
    metadata(object)$sc3$distances <- dists
    return(object)
}
 
#' @rdname sc3_calc_dists
#' @aliases sc3_calc_dists
setMethod("sc3_calc_dists", signature(object = "SingleCellExperiment"), sc3_calc_dists.SingleCellExperiment)  

通过setMethod链接到一起的。

顺路找到了原函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#' Calculate a distance matrix
#'
#' Distance between the cells, i.e. columns, in the input expression matrix are
#' calculated using the Euclidean, Pearson and Spearman metrics to construct
#' distance matrices.
#'
#' @param data expression matrix
#' @param method one of the distance metrics: 'spearman', 'pearson', 'euclidean'
#' @return distance matrix
#'
#' @importFrom stats cor dist
#'
#' @useDynLib SC3
#' @importFrom Rcpp sourceCpp
#'
calculate_distance <- function(data, method) {
    return(if (method == "spearman") {
        as.matrix(1 - cor(data, method = "spearman"))
    } else if (method == "pearson") {
        as.matrix(1 - cor(data, method = "pearson"))
    } else {
        ED2(data)
    })
}
 
#' Distance matrix transformation
#'
#' All distance matrices are transformed using either principal component
#' analysis (PCA) or by calculating the
#' eigenvectors of the graph Laplacian (Spectral).
#' The columns of the resulting matrices are then sorted in
#' descending order by their corresponding eigenvalues.
#'
#' @param dists distance matrix
#' @param method transformation method: either 'pca' or
#' 'laplacian'
#' @return transformed distance matrix
#'
#' @importFrom stats prcomp cmdscale
#'
transformation <- function(dists, method) {
    if (method == "pca") {
        t <- prcomp(dists, center = TRUE, scale. = TRUE)
        return(t$rotation)
    } else if (method == "laplacian") {
        L <- norm_laplacian(dists)
        l <- eigen(L)
        # sort eigenvectors by their eigenvalues
        return(l$vectors[, order(l$values)])
    }
}
 
#' Calculate consensus matrix
#'
#' Consensus matrix is calculated using the Cluster-based Similarity
#' Partitioning Algorithm (CSPA). For each clustering solution a binary
#' similarity matrix is constructed from the corresponding cell labels:
#' if two cells belong to the same cluster, their similarity is 1, otherwise
#' the similarity is 0. A consensus matrix is calculated by averaging all
#' similarity matrices.
#'
#' @param clusts a matrix containing clustering solutions in columns
#' @return consensus matrix
#'
#' @useDynLib SC3
#' @importFrom Rcpp sourceCpp
#' @export
consensus_matrix <- function(clusts) {
    res <- consmx(clusts)
    colnames(res) <- as.character(c(1:nrow(clusts)))
    rownames(res) <- as.character(c(1:nrow(clusts)))
    return(res)
}

 

  • 距离计算
  • 转换
  • consensus

都在这里。。。  

 

ED2是他们实验室自己用Rcpp写的一个计算欧氏距离的工具。

transformation输入的是对称的距离矩阵(行列都是样本细胞),然后做完PCA,返回了rotation,不知道这样做有什么意义?

还真有用PCA来处理距离相似度矩阵的,MDS,目的就是降维,因为后面要用kmean聚类;

然后对每一个降维了的矩阵用kmeans;

consensus用的是这个算法:Cluster-based Similarity Partitioning Algorithm (CSPA),做这个的意义何在?输入是每个细胞的多重聚类结果,然后做了一个一致性统一。

 

 

参考:

拉普拉斯矩阵(Laplacian matrix)

 

posted @   Life·Intelligence  阅读(1506)  评论(0编辑  收藏  举报
(评论功能已被禁用)
编辑推荐:
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
历史上的今天:
2018-05-17 如何理解我们的记忆?
TOP
点击右上角即可分享
微信分享提示