SC3聚类 | 拉普拉斯矩阵 | Laplacian matrix | 图论 | R代码
Laplacian和PCA貌似是同一种性质的方法,坐标系变换。只是拉普拉斯属于图论的范畴,术语更加专业了。
要看就把一篇文章看完整,再看其中有什么值得借鉴的,总结归纳理解后的东西才是属于你的。
问题:
1. 这篇文章有哪些亮点决定他能发NM?单细胞,consensus,较好的表现,包装了一些专业的术语,显得自己很专业,其实真正做的东西很少;
2. consensus方法的本质是什么?
3. 工具的评估准则?ARI,silhouette index
4. SC3的最大缺点是什么?速度太慢,超过1000个细胞就非常耗费计算和存储资源
5. 能看懂SC3这个R包的逻辑吗?核心的就4步,多种距离度量,转换,kmeans聚类,consensus;
The main sc3 method explained above is a wrapper that calls several other SC3 methods in the following order:
- sc3_prepare
- sc3_estimate_k - Tracy-Widom theory - random matrix theory (RMT)
- sc3_calc_dists
- sc3_calc_transfs
- sc3_kmeans
- sc3_calc_consens
- sc3_calc_biology
6. 有很多问题没有回答,这篇文章偏技工!核心就是kmeans,打了个复杂的包而已。
- 不同距离的度量有什么差异?
- 为什么要做两种转换PCA和laplacian?
- 为什么选择了kmeans?不知道它有天然的劣势吗
- 做consensus的理论依据是什么?凭什么说做了一致性后结果就更好?
最近在看SC3聚类这篇文章,SC3使用了这个工具。
SC3: consensus clustering of single-cell RNA-seq data
All distance matrices are then transformed using either principal component analysis (PCA) or by calculating the eigenvectors of the associated graph Laplacian (L = I – D–1/2AD–1/2, where I is the identity matrix, A is a similarity matrix (A = e–A′/max(A′)), where A′ is a distance matrix) and D is the degree matrix of A, a diagonal matrix that contains the row-sums of A on the diagonal (Dii = ΣjAij). The columns of the resulting matrices are then sorted in ascending order by their corresponding eigenvalues.
先看下该工具的功能:SC3 package manual
跑一下常规代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | library (SingleCellExperiment) library (SC3) library (scater) head (ann) yan[1:3, 1:3] # create a SingleCellExperiment object sce <- SingleCellExperiment ( assays = list ( counts = as.matrix (yan), logcounts = log2 ( as.matrix (yan) + 1) ), colData = ann ) # define feature names in feature_symbol column rowData (sce)$feature_symbol <- rownames (sce) # remove features with duplicated names sce <- sce[! duplicated ( rowData (sce)$feature_symbol), ] # define spike-ins isSpike (sce, "ERCC" ) <- grepl ( "ERCC" , rowData (sce)$feature_symbol) plotPCA (sce, colour_by = "cell_type1" ) sce <- sc3 (sce, ks = 2:4, biology = TRUE ) # sc3_interactive(sce) # sc3_export_results_xls(sce) ###################################### sce <- sc3_prepare (sce) sce <- sc3_estimate_k (sce) sce <- sc3_calc_dists (sce) names ( metadata (sce)$sc3$distances) sce <- sc3_calc_transfs (sce) names ( metadata (sce)$sc3$transformations) metadata (sce)$sc3$distances sce <- sc3_kmeans (sce, ks = 2:4) names ( metadata (sce)$sc3$kmeans) col_data <- colData (sce) head (col_data[ , grep ( "sc3_" , colnames (col_data))]) sce <- sc3_calc_consens (sce) names ( metadata (sce)$sc3$consensus) names ( metadata (sce)$sc3$consensus$`3`) col_data <- colData (sce) head (col_data[ , grep ( "sc3_" , colnames (col_data))]) sce <- sc3_calc_biology (sce, ks = 2:4) sce <- sc3_run_svm (sce, ks = 2:4) col_data <- colData (sce) head (col_data[ , grep ( "sc3_" , colnames (col_data))]) |
接下来会尝试拆一下该工具。
怎么拆这个工具?
这种封装的很好的R包其实比较难拆,一般的通过函数名字就可以看到R代码,但这里你输入函数名,如sc3_calc_dists,看到的只是以下的封装好的代码:
1 2 3 4 5 6 7 8 | new ( "nonstandardGenericFunction" , .Data = function (object) { standardGeneric ( "sc3_calc_dists" ) }, generic = structure ( "sc3_calc_dists" , package = "SC3" ), package = "SC3" , group = list (), valueClass = character (0), signature = "object" , default = NULL , skeleton = ( function (object) stop ( "invalid call in method dispatch to 'sc3_calc_dists' (no default method)" , domain = NA ))(object)) |
暂时还不熟悉这种形式,所以只能通过函数名去GitHub里面查了。
GitHub真的很优秀,可以直接查文件内部代码,可以很快定位到sc3_calc_dists。
再配合这个目录插件,效率提高了不少,https://www.octotree.io/?utm_source=lite&utm_medium=extension
以下是封装前的代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | #' Calculate distances between the cells. #' #' This function calculates distances between the cells. It #' creates and populates the following items of the \code{sc3} slot of the \code{metadata(object)}: #' \itemize{ #' \item \code{distances} - contains a list of distance matrices corresponding to #' Euclidean, Pearson and Spearman distances. #' } #' #' @name sc3_calc_dists #' @aliases sc3_calc_dists, sc3_calc_dists,SingleCellExperiment-method #' #' @param object an object of \code{SingleCellExperiment} class #' #' @return an object of \code{SingleCellExperiment} class #' #' @importFrom doRNG %dorng% #' @importFrom foreach foreach %dopar% #' @importFrom parallel makeCluster stopCluster #' @importFrom doParallel registerDoParallel sc3_calc_dists.SingleCellExperiment <- function (object) { dataset <- get_processed_dataset (object) # check whether in the SVM regime if (! is.null ( metadata (object)$sc3$svm_train_inds)) { dataset <- dataset[, metadata (object)$sc3$svm_train_inds] } # NULLing the variables to avoid notes in R CMD CHECK i <- NULL distances <- c ( "euclidean" , "pearson" , "spearman" ) message ( "Calculating distances between the cells..." ) if ( metadata (object)$sc3$n_cores > length (distances)) { n_cores <- length (distances) } else { n_cores <- metadata (object)$sc3$n_cores } cl <- parallel:: makeCluster (n_cores, outfile = "" ) doParallel:: registerDoParallel (cl, cores = n_cores) # calculate distances in parallel dists <- foreach:: foreach (i = distances) %dorng% { try ({ calculate_distance (dataset, i) }) } # stop local cluster parallel:: stopCluster (cl) names (dists) <- distances metadata (object)$sc3$distances <- dists return (object) } #' @rdname sc3_calc_dists #' @aliases sc3_calc_dists setMethod ( "sc3_calc_dists" , signature (object = "SingleCellExperiment" ), sc3_calc_dists.SingleCellExperiment) |
通过setMethod链接到一起的。
顺路找到了原函数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | #' Calculate a distance matrix #' #' Distance between the cells, i.e. columns, in the input expression matrix are #' calculated using the Euclidean, Pearson and Spearman metrics to construct #' distance matrices. #' #' @param data expression matrix #' @param method one of the distance metrics: 'spearman', 'pearson', 'euclidean' #' @return distance matrix #' #' @importFrom stats cor dist #' #' @useDynLib SC3 #' @importFrom Rcpp sourceCpp #' calculate_distance <- function (data, method) { return ( if (method == "spearman" ) { as.matrix (1 - cor (data, method = "spearman" )) } else if (method == "pearson" ) { as.matrix (1 - cor (data, method = "pearson" )) } else { ED2 (data) }) } #' Distance matrix transformation #' #' All distance matrices are transformed using either principal component #' analysis (PCA) or by calculating the #' eigenvectors of the graph Laplacian (Spectral). #' The columns of the resulting matrices are then sorted in #' descending order by their corresponding eigenvalues. #' #' @param dists distance matrix #' @param method transformation method: either 'pca' or #' 'laplacian' #' @return transformed distance matrix #' #' @importFrom stats prcomp cmdscale #' transformation <- function (dists, method) { if (method == "pca" ) { t <- prcomp (dists, center = TRUE , scale. = TRUE ) return (t$rotation) } else if (method == "laplacian" ) { L <- norm_laplacian (dists) l <- eigen (L) # sort eigenvectors by their eigenvalues return (l$vectors[, order (l$values)]) } } #' Calculate consensus matrix #' #' Consensus matrix is calculated using the Cluster-based Similarity #' Partitioning Algorithm (CSPA). For each clustering solution a binary #' similarity matrix is constructed from the corresponding cell labels: #' if two cells belong to the same cluster, their similarity is 1, otherwise #' the similarity is 0. A consensus matrix is calculated by averaging all #' similarity matrices. #' #' @param clusts a matrix containing clustering solutions in columns #' @return consensus matrix #' #' @useDynLib SC3 #' @importFrom Rcpp sourceCpp #' @export consensus_matrix <- function (clusts) { res <- consmx (clusts) colnames (res) <- as.character ( c (1: nrow (clusts))) rownames (res) <- as.character ( c (1: nrow (clusts))) return (res) } |
- 距离计算
- 转换
- consensus
都在这里。。。
ED2是他们实验室自己用Rcpp写的一个计算欧氏距离的工具。
transformation输入的是对称的距离矩阵(行列都是样本细胞),然后做完PCA,返回了rotation,不知道这样做有什么意义?
还真有用PCA来处理距离相似度矩阵的,MDS,目的就是降维,因为后面要用kmean聚类;
然后对每一个降维了的矩阵用kmeans;
consensus用的是这个算法:Cluster-based Similarity Partitioning Algorithm (CSPA),做这个的意义何在?输入是每个细胞的多重聚类结果,然后做了一个一致性统一。
参考:
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)
2018-05-17 如何理解我们的记忆?