SC3聚类 | 拉普拉斯矩阵 | Laplacian matrix | 图论 | R代码
Laplacian和PCA貌似是同一种性质的方法,坐标系变换。只是拉普拉斯属于图论的范畴,术语更加专业了。
要看就把一篇文章看完整,再看其中有什么值得借鉴的,总结归纳理解后的东西才是属于你的。
问题:
1. 这篇文章有哪些亮点决定他能发NM?单细胞,consensus,较好的表现,包装了一些专业的术语,显得自己很专业,其实真正做的东西很少;
2. consensus方法的本质是什么?
3. 工具的评估准则?ARI,silhouette index
4. SC3的最大缺点是什么?速度太慢,超过1000个细胞就非常耗费计算和存储资源
5. 能看懂SC3这个R包的逻辑吗?核心的就4步,多种距离度量,转换,kmeans聚类,consensus;
The main sc3 method explained above is a wrapper that calls several other SC3 methods in the following order:
- sc3_prepare
- sc3_estimate_k - Tracy-Widom theory - random matrix theory (RMT)
- sc3_calc_dists
- sc3_calc_transfs
- sc3_kmeans
- sc3_calc_consens
- sc3_calc_biology
6. 有很多问题没有回答,这篇文章偏技工!核心就是kmeans,打了个复杂的包而已。
- 不同距离的度量有什么差异?
- 为什么要做两种转换PCA和laplacian?
- 为什么选择了kmeans?不知道它有天然的劣势吗
- 做consensus的理论依据是什么?凭什么说做了一致性后结果就更好?
最近在看SC3聚类这篇文章,SC3使用了这个工具。
SC3: consensus clustering of single-cell RNA-seq data
All distance matrices are then transformed using either principal component analysis (PCA) or by calculating the eigenvectors of the associated graph Laplacian (L = I – D–1/2AD–1/2, where I is the identity matrix, A is a similarity matrix (A = e–A′/max(A′)), where A′ is a distance matrix) and D is the degree matrix of A, a diagonal matrix that contains the row-sums of A on the diagonal (Dii = ΣjAij). The columns of the resulting matrices are then sorted in ascending order by their corresponding eigenvalues.
先看下该工具的功能:SC3 package manual
跑一下常规代码:
library(SingleCellExperiment)
library(SC3)
library(scater) head(ann)
yan[1:3, 1:3] # create a SingleCellExperiment object
sce <- SingleCellExperiment(
assays = list(
counts = as.matrix(yan),
logcounts = log2(as.matrix(yan) + 1)
),
colData = ann
) # define feature names in feature_symbol column
rowData(sce)$feature_symbol <- rownames(sce)
# remove features with duplicated names
sce <- sce[!duplicated(rowData(sce)$feature_symbol), ] # define spike-ins
isSpike(sce, "ERCC") <- grepl("ERCC", rowData(sce)$feature_symbol) plotPCA(sce, colour_by = "cell_type1") sce <- sc3(sce, ks = 2:4, biology = TRUE)
# sc3_interactive(sce)
# sc3_export_results_xls(sce) ######################################
sce <- sc3_prepare(sce) sce <- sc3_estimate_k(sce) sce <- sc3_calc_dists(sce)
names(metadata(sce)$sc3$distances) sce <- sc3_calc_transfs(sce)
names(metadata(sce)$sc3$transformations)
metadata(sce)$sc3$distances sce <- sc3_kmeans(sce, ks = 2:4)
names(metadata(sce)$sc3$kmeans) col_data <- colData(sce)
head(col_data[ , grep("sc3_", colnames(col_data))])
sce <- sc3_calc_consens(sce)
names(metadata(sce)$sc3$consensus)
names(metadata(sce)$sc3$consensus$`3`) col_data <- colData(sce)
head(col_data[ , grep("sc3_", colnames(col_data))]) sce <- sc3_calc_biology(sce, ks = 2:4) sce <- sc3_run_svm(sce, ks = 2:4)
col_data <- colData(sce)
head(col_data[ , grep("sc3_", colnames(col_data))])
接下来会尝试拆一下该工具。
怎么拆这个工具?
这种封装的很好的R包其实比较难拆,一般的通过函数名字就可以看到R代码,但这里你输入函数名,如sc3_calc_dists,看到的只是以下的封装好的代码:
new("nonstandardGenericFunction", .Data = function (object)
{
standardGeneric("sc3_calc_dists")
}, generic = structure("sc3_calc_dists", package = "SC3"), package = "SC3",
group = list(), valueClass = character(0), signature = "object",
default = NULL, skeleton = (function (object)
stop("invalid call in method dispatch to 'sc3_calc_dists' (no default method)",
domain = NA))(object))
暂时还不熟悉这种形式,所以只能通过函数名去GitHub里面查了。
GitHub真的很优秀,可以直接查文件内部代码,可以很快定位到sc3_calc_dists。
再配合这个目录插件,效率提高了不少,https://www.octotree.io/?utm_source=lite&utm_medium=extension
以下是封装前的代码:
#' Calculate distances between the cells.
#'
#' This function calculates distances between the cells. It
#' creates and populates the following items of the \code{sc3} slot of the \code{metadata(object)}:
#' \itemize{
#' \item \code{distances} - contains a list of distance matrices corresponding to
#' Euclidean, Pearson and Spearman distances.
#' }
#'
#' @name sc3_calc_dists
#' @aliases sc3_calc_dists, sc3_calc_dists,SingleCellExperiment-method
#'
#' @param object an object of \code{SingleCellExperiment} class
#'
#' @return an object of \code{SingleCellExperiment} class
#'
#' @importFrom doRNG %dorng%
#' @importFrom foreach foreach %dopar%
#' @importFrom parallel makeCluster stopCluster
#' @importFrom doParallel registerDoParallel
sc3_calc_dists.SingleCellExperiment <- function(object) {
dataset <- get_processed_dataset(object) # check whether in the SVM regime
if (!is.null(metadata(object)$sc3$svm_train_inds)) {
dataset <- dataset[, metadata(object)$sc3$svm_train_inds]
} # NULLing the variables to avoid notes in R CMD CHECK
i <- NULL distances <- c("euclidean", "pearson", "spearman") message("Calculating distances between the cells...") if (metadata(object)$sc3$n_cores > length(distances)) {
n_cores <- length(distances)
} else {
n_cores <- metadata(object)$sc3$n_cores
} cl <- parallel::makeCluster(n_cores, outfile = "")
doParallel::registerDoParallel(cl, cores = n_cores) # calculate distances in parallel
dists <- foreach::foreach(i = distances) %dorng% {
try({
calculate_distance(dataset, i)
})
} # stop local cluster
parallel::stopCluster(cl) names(dists) <- distances metadata(object)$sc3$distances <- dists
return(object)
} #' @rdname sc3_calc_dists
#' @aliases sc3_calc_dists
setMethod("sc3_calc_dists", signature(object = "SingleCellExperiment"), sc3_calc_dists.SingleCellExperiment)
通过setMethod链接到一起的。
顺路找到了原函数:
#' Calculate a distance matrix
#'
#' Distance between the cells, i.e. columns, in the input expression matrix are
#' calculated using the Euclidean, Pearson and Spearman metrics to construct
#' distance matrices.
#'
#' @param data expression matrix
#' @param method one of the distance metrics: 'spearman', 'pearson', 'euclidean'
#' @return distance matrix
#'
#' @importFrom stats cor dist
#'
#' @useDynLib SC3
#' @importFrom Rcpp sourceCpp
#'
calculate_distance <- function(data, method) {
return(if (method == "spearman") {
as.matrix(1 - cor(data, method = "spearman"))
} else if (method == "pearson") {
as.matrix(1 - cor(data, method = "pearson"))
} else {
ED2(data)
})
} #' Distance matrix transformation
#'
#' All distance matrices are transformed using either principal component
#' analysis (PCA) or by calculating the
#' eigenvectors of the graph Laplacian (Spectral).
#' The columns of the resulting matrices are then sorted in
#' descending order by their corresponding eigenvalues.
#'
#' @param dists distance matrix
#' @param method transformation method: either 'pca' or
#' 'laplacian'
#' @return transformed distance matrix
#'
#' @importFrom stats prcomp cmdscale
#'
transformation <- function(dists, method) {
if (method == "pca") {
t <- prcomp(dists, center = TRUE, scale. = TRUE)
return(t$rotation)
} else if (method == "laplacian") {
L <- norm_laplacian(dists)
l <- eigen(L)
# sort eigenvectors by their eigenvalues
return(l$vectors[, order(l$values)])
}
} #' Calculate consensus matrix
#'
#' Consensus matrix is calculated using the Cluster-based Similarity
#' Partitioning Algorithm (CSPA). For each clustering solution a binary
#' similarity matrix is constructed from the corresponding cell labels:
#' if two cells belong to the same cluster, their similarity is 1, otherwise
#' the similarity is 0. A consensus matrix is calculated by averaging all
#' similarity matrices.
#'
#' @param clusts a matrix containing clustering solutions in columns
#' @return consensus matrix
#'
#' @useDynLib SC3
#' @importFrom Rcpp sourceCpp
#' @export
consensus_matrix <- function(clusts) {
res <- consmx(clusts)
colnames(res) <- as.character(c(1:nrow(clusts)))
rownames(res) <- as.character(c(1:nrow(clusts)))
return(res)
}
- 距离计算
- 转换
- consensus
都在这里。。。
ED2是他们实验室自己用Rcpp写的一个计算欧氏距离的工具。
transformation输入的是对称的距离矩阵(行列都是样本细胞),然后做完PCA,返回了rotation,不知道这样做有什么意义?
还真有用PCA来处理距离相似度矩阵的,MDS,目的就是降维,因为后面要用kmean聚类;
然后对每一个降维了的矩阵用kmeans;
consensus用的是这个算法:Cluster-based Similarity Partitioning Algorithm (CSPA),做这个的意义何在?输入是每个细胞的多重聚类结果,然后做了一个一致性统一。
参考:
SC3聚类 | 拉普拉斯矩阵 | Laplacian matrix | 图论 | R代码的更多相关文章
- 拉普拉斯矩阵(Laplacian Matrix) 及半正定性证明
摘自 https://blog.csdn.net/beiyangdashu/article/details/49300479 和 https://en.wikipedia.org/wiki/Lapla ...
- 拉普拉斯矩阵(Laplacian matrix)
原文地址:https://www.jianshu.com/p/f864bac6cb7a 拉普拉斯矩阵是图论中用到的一种重要矩阵,给定一个有n个顶点的图 G=(V,E),其拉普拉斯矩阵被定义为 L = ...
- 拉普拉斯矩阵(Laplace Matrix)与瑞利熵(Rayleigh quotient)
作者:桂. 时间:2017-04-13 07:43:03 链接:http://www.cnblogs.com/xingshansi/p/6702188.html 声明:欢迎被转载,不过记得注明出处哦 ...
- R语言编程艺术# 矩阵(matrix)和数组(array)
矩阵(matrix)是一种特殊的向量,包含两个附加的属性:行数和列数.所以矩阵也是和向量一样,有模式(数据类型)的概念.(但反过来,向量却不能看作是只有一列或一行的矩阵. 数组(array)是R里更一 ...
- R语言编程艺术#02#矩阵(matrix)和数组(array)
矩阵(matrix)是一种特殊的向量,包含两个附加的属性:行数和列数.所以矩阵也是和向量一样,有模式(数据类型)的概念.(但反过来,向量却不能看作是只有一列或一行的矩阵. 数组(array)是R里更一 ...
- graph Laplacian 拉普拉斯矩阵
转自:https://www.kechuang.org/t/84022?page=0&highlight=859356,感谢分享! 在机器学习.多维信号处理等领域,凡涉及到图论的地方,相信小伙 ...
- 从零开始学习R语言(三)——数据结构之“矩阵(Matrix)”
本文首发于知乎专栏:https://zhuanlan.zhihu.com/p/60140022 也同步更新于我的个人博客:https://www.nickwu.cn/blog/id=129 3. [二 ...
- 【Math for ML】矩阵分解(Matrix Decompositions) (下)
[Math for ML]矩阵分解(Matrix Decompositions) (上) I. 奇异值分解(Singular Value Decomposition) 1. 定义 Singular V ...
- 【Math for ML】矩阵分解(Matrix Decompositions) (上)
I. 行列式(Determinants)和迹(Trace) 1. 行列式(Determinants) 为避免和绝对值符号混淆,本文一般使用\(det(A)\)来表示矩阵\(A\)的行列式.另外这里的\ ...
随机推荐
- linux防火墙开放端口,针对固定ip开放端口
编辑/etc/sysconfig/iptables,添加 -A INPUT -m state --state NEW -m tcp -p tcp -s 127.0.0.1 --dport 6379 - ...
- 腾讯云服务器搭建WampServer环境
软件环境Windows Server 2008 R2 企业版 SP1 64位 刚刚进入 Windows Server ,你会看到以下界面: 列出了服务器的基础信息和常用配置下载 XAMPP https ...
- C++(四十)— C++中一个class类对象占用多少内字节
一个空的class在内存中多少字节?如果加入一个成员函数后是多大?这个成员函数存储在内存中什么部分? 一个Class对象需要占用多大的内存空间.最权威的结论是: 非静态成员变量总合. 加上编译器为了C ...
- 采用MySQL-MMM做DB高可用时,遇到的一个小坑
一.服务器分布 二.MySQL-MMM 配置 (1).公共配置[所有DB节点:Master1.Master2.Slave1.Slave2 Monitor节点] # vim /etc/mysql ...
- SQL进阶系列之2自连接
写在前面 一般地,SQL的连接运算根据其特征的不同,有着不同的名称,比如内连接.外连接.交叉连接等,这些连接大多是以不同的表或视图为对象进行的,针对相同的表进行的连接成为自连接.理解自连接有助于我们理 ...
- Java精通并发-多线程同步关系实例剖析与详解
在上一次https://www.cnblogs.com/webor2006/p/11422587.html中通过实践来解了一个案例,先来回顾一下习题: 编写一个多线程程序,实现这样的一个目标: 1.存 ...
- Ffmpeg常用转码命令
H264视频转ts视频流 ffmpeg -i test.h264 -vcodec copy -f mpegts test.ts H264视频转mp4 ffmpeg -i test.h264 -vcod ...
- Python正则提取数据单引号内数据,并判断是否是空列表(是否提取到数据)
#coding=utf- import re string1="asdfgh'355'dfsfas" string2="fafafasfasdfasdf" pa ...
- Nginx请求处理流程
因为 Nginx 运行在企业内网的最外层也就是边缘节点,那么他处理的的流量是其他应用服务器处理流量的数倍,甚至几个数量级,我们知道任何一种问题在不同的数量级下,他的解决方案是完全不同的,所以在 Ngi ...
- 「SDOI2016」征途
征途 Pine开始了从S地到T地的征途. 从S地到T地的路可以划分成\(n\)段,相邻两段路的分界点设有休息站. Pine计划用\(m\)天到达T地.除第\(m\)天外,每一天晚上Pine都必须在休息 ...