极简 | GRN | SCENIC | pySCENIC | 安装使用最新版scenicplus

2025年01月24日

level 1：单细胞RNA-seq引爆了GRN的分析，无非就是co-expression，motif富集，代表性工具就是pySCENIC。

level 2：单细胞多组学则更加减少了假阳性，代表性工具就是scenicplus。

level 3：我们lab的dTAG的bulk多组学，直接的binding、causal的表达因果，这才是真正的GRN。

目前的认知，单细胞里面的一个cell type or cell state的full marker里，肯定有一些TF是driver，这些就肯定可以被构建成GRN，其实信息都在marker里，我们只需要找到那些TF。

单细胞多组学无非就是提供了一个更fancy的证据motif的活性，目前看SOX9本身就很靠谱。

当然最终肯定是要做指定的某个TF的KO、KD的多组学。

那就先跑一下流程

软件安装，pySCENIC速度显著快于R的版本。

https://github.com/aertslab/pySCENIC

/home/zz950/softwares/miniconda3/envs/r4p3/bin/pyscenic -h

目前最后更新的版本是0.12.1，Nov 21, 2022，检查一下是否是最新版本。【YES，确定是最新版】

准备数据

一个基本的常识，correlation非常依赖context，即你使用哪些细胞，出来的correlation是完全不一样的。

我需要做4个GRN

pan-stem cell的GRN；【pan的定义就是在30%以上的细胞里面表达】
Lgr5 stem cell的GRN；
proliferative stem cell的GRN；
revival stem cell的GRN；

根据我现在的认知，必须有阴性对照的细胞，不然出不来correlation，那就拿enterocyte来做阴性对照。

/home/zz950/softwares/miniconda3/envs/r4p3/bin/pyscenic grn -h
usage: pyscenic grn [-h] [-o OUTPUT] [-t] [-m {genie3,grnboost2}] [--seed SEED] [--num_workers NUM_WORKERS]
                    [--client_or_address CLIENT_OR_ADDRESS] [--cell_id_attribute CELL_ID_ATTRIBUTE] [--gene_attribute GENE_ATTRIBUTE] [--sparse]
                    expression_mtx_fname tfs_fname
 
positional arguments:
  expression_mtx_fname  The name of the file that contains the expression matrix for the single cell experiment. Two file formats are supported:
                        csv (rows=cells x columns=genes) or loom (rows=genes x columns=cells).
  tfs_fname             The name of the file that contains the list of transcription factors (TXT; one TF per line).
 
optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output file/stream, i.e. a table of TF-target genes (CSV).
  -t, --transpose       Transpose the expression matrix (rows=genes x columns=cells).
  -m {genie3,grnboost2}, --method {genie3,grnboost2}
                        The algorithm for gene regulatory network reconstruction (default: grnboost2).
  --seed SEED           Seed value for regressor random state initialization. Applies to both GENIE3 and GRNBoost2. The default is to use a
                        random seed.
 
computation arguments:
  --num_workers NUM_WORKERS
                        The number of workers to use. Only valid if using dask_multiprocessing, custom_multiprocessing or local as mode.
                        (default: 8).
  --client_or_address CLIENT_OR_ADDRESS
                        The client or the IP address of the dask scheduler to use. (Only required of dask_cluster is selected as mode)
 
loom file arguments:
  --cell_id_attribute CELL_ID_ATTRIBUTE
                        The name of the column attribute that specifies the identifiers of the cells in the loom file.
  --gene_attribute GENE_ATTRIBUTE
                        The name of the row attribute that specifies the gene symbols in the loom file.
  --sparse              If set, load the expression data as a sparse matrix. Currently applies to the grn inference step only.

下载TF list

$ wc -l allTFs_hg38.txt
1892 allTFs_hg38.txt

wc -l allTFs_mm.txt
1860 allTFs_mm.txt

$ pyscenic ctx -h
usage: pyscenic ctx [-h] [-o OUTPUT] [-n] [--chunk_size CHUNK_SIZE] [--mode {custom_multiprocessing,dask_multiprocessing,dask_cluster}] [-a]
                    [-t] [--rank_threshold RANK_THRESHOLD] [--auc_threshold AUC_THRESHOLD] [--nes_threshold NES_THRESHOLD]
                    [--min_orthologous_identity MIN_ORTHOLOGOUS_IDENTITY] [--max_similarity_fdr MAX_SIMILARITY_FDR] --annotations_fname
                    ANNOTATIONS_FNAME [--num_workers NUM_WORKERS] [--client_or_address CLIENT_OR_ADDRESS]
                    [--thresholds THRESHOLDS [THRESHOLDS ...]] [--top_n_targets TOP_N_TARGETS [TOP_N_TARGETS ...]]
                    [--top_n_regulators TOP_N_REGULATORS [TOP_N_REGULATORS ...]] [--min_genes MIN_GENES]
                    [--expression_mtx_fname EXPRESSION_MTX_FNAME] [--mask_dropouts] [--cell_id_attribute CELL_ID_ATTRIBUTE]
                    [--gene_attribute GENE_ATTRIBUTE] [--sparse]
                    module_fname database_fname [database_fname ...]
 
positional arguments:
  module_fname          The name of the file that contains the signature or the co-expression modules. The following formats are supported: CSV
                        or TSV (adjacencies), YAML, GMT and DAT (modules)
  database_fname        The name(s) of the regulatory feature databases. Two file formats are supported: feather or db (legacy).
 
optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output file/stream, i.e. a table of enriched motifs and target genes (csv, tsv) or collection of regulons (yaml, gmt,
                        dat, json).
  -n, --no_pruning      Do not perform pruning, i.e. find enriched motifs.
  --chunk_size CHUNK_SIZE
                        The size of the module chunks assigned to a node in the dask graph (default: 100).
  --mode {custom_multiprocessing,dask_multiprocessing,dask_cluster}
                        The mode to be used for computing (default: custom_multiprocessing).
  -a, --all_modules     Included positive and negative regulons in the analysis (default: no, i.e. only positive).
  -t, --transpose       Transpose the expression matrix (rows=genes x columns=cells).
 
motif enrichment arguments:
  --rank_threshold RANK_THRESHOLD
                        The rank threshold used for deriving the target genes of an enriched motif (default: 5000).
  --auc_threshold AUC_THRESHOLD
                        The threshold used for calculating the AUC of a feature as fraction of ranked genes (default: 0.05).
  --nes_threshold NES_THRESHOLD
                        The Normalized Enrichment Score (NES) threshold for finding enriched features (default: 3.0).
 
motif annotation arguments:
  --min_orthologous_identity MIN_ORTHOLOGOUS_IDENTITY
                        Minimum orthologous identity to use when annotating enriched motifs (default: 0.0).
  --max_similarity_fdr MAX_SIMILARITY_FDR
                        Maximum FDR in motif similarity to use when annotating enriched motifs (default: 0.001).
  --annotations_fname ANNOTATIONS_FNAME
                        The name of the file that contains the motif annotations to use.
 
computation arguments:
  --num_workers NUM_WORKERS
                        The number of workers to use. Only valid if using dask_multiprocessing, custom_multiprocessing or local as mode.
                        (default: 64).
  --client_or_address CLIENT_OR_ADDRESS
                        The client or the IP address of the dask scheduler to use. (Only required of dask_cluster is selected as mode)
 
module generation arguments:
  --thresholds THRESHOLDS [THRESHOLDS ...]
                        The first method to create the TF-modules based on the best targets for each transcription factor (default: 0.75 0.90).
  --top_n_targets TOP_N_TARGETS [TOP_N_TARGETS ...]
                        The second method is to select the top targets for a given TF. (default: 50)
  --top_n_regulators TOP_N_REGULATORS [TOP_N_REGULATORS ...]
                        The alternative way to create the TF-modules is to select the best regulators for each gene. (default: 5 10 50)
  --min_genes MIN_GENES
                        The minimum number of genes in a module (default: 20).
  --expression_mtx_fname EXPRESSION_MTX_FNAME
                        The name of the file that contains the expression matrix for the single cell experiment. Two file formats are supported:
                        csv (rows=cells x columns=genes) or loom (rows=genes x columns=cells). (Only required if modules need to be generated)
  --mask_dropouts       If modules need to be generated, this controls whether cell dropouts (cells in which expression of either TF or target
                        gene is 0) are masked when calculating the correlation between a TF-target pair. This affects which target genes are
                        included in the initial modules, and the final pruned regulon (by default only positive regulons are kept (see
                        --all_modules option)). The default value in pySCENIC 0.9.16 and previous versions was to mask dropouts when calculating
                        the correlation; however, all cells are now kept by default, to match the R version.
 
loom file arguments:
  --cell_id_attribute CELL_ID_ATTRIBUTE
                        The name of the column attribute that specifies the identifiers of the cells in the loom file.
  --gene_attribute GENE_ATTRIBUTE
                        The name of the row attribute that specifies the gene symbols in the loom file.
  --sparse              If set, load the expression data as a sparse matrix. Currently applies to the grn inference step only.

$ pyscenic aucell -h
usage: pyscenic aucell [-h] [-o OUTPUT] [-t] [-w] [--num_workers NUM_WORKERS] [--seed SEED] [--rank_threshold RANK_THRESHOLD]
                       [--auc_threshold AUC_THRESHOLD] [--nes_threshold NES_THRESHOLD] [--cell_id_attribute CELL_ID_ATTRIBUTE]
                       [--gene_attribute GENE_ATTRIBUTE] [--sparse]
                       expression_mtx_fname signatures_fname
 
positional arguments:
  expression_mtx_fname  The name of the file that contains the expression matrix for the single cell experiment. Two file formats are supported:
                        csv (rows=cells x columns=genes) or loom (rows=genes x columns=cells).
  signatures_fname      The name of the file that contains the gene signatures. Three file formats are supported: gmt, yaml or dat (pickle).
 
optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output file/stream, a matrix of AUC values. Two file formats are supported: csv or loom. If loom file is specified the
                        loom file while contain the original expression matrix and the calculated AUC values as extra column attributes.
  -t, --transpose       Transpose the expression matrix if supplied as csv (rows=genes x columns=cells).
  -w, --weights         Use weights associated with genes in recovery analysis. Is only relevant when gene signatures are supplied as json
                        format.
  --num_workers NUM_WORKERS
                        The number of workers to use (default: 64).
  --seed SEED           Seed for the expression matrix ranking step. The default is to use a random seed.
 
motif enrichment arguments:
  --rank_threshold RANK_THRESHOLD
                        The rank threshold used for deriving the target genes of an enriched motif (default: 5000).
  --auc_threshold AUC_THRESHOLD
                        The threshold used for calculating the AUC of a feature as fraction of ranked genes (default: 0.05).
  --nes_threshold NES_THRESHOLD
                        The Normalized Enrichment Score (NES) threshold for finding enriched features (default: 3.0).
 
loom file arguments:
  --cell_id_attribute CELL_ID_ATTRIBUTE
                        The name of the column attribute that specifies the identifiers of the cells in the loom file.
  --gene_attribute GENE_ATTRIBUTE
                        The name of the row attribute that specifies the gene symbols in the loom file.
  --sparse              If set, load the expression data as a sparse matrix. Currently applies to the grn inference step only.

http://localhost:17449/lab/tree/projects/CRC_Atlas/2023_Qin_CSC_fetal_Cell/scenicplus.ipynb

# ERROR: Package 'scenicplus' requires a different Python: 3.11.11 not in '<=3.11.8,>=3.8'
conda create --name scenicplus3 python=3.11.8
# fail for version 1 & 2

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

## ModuleNotFoundError: No module named 'ipykernel'
# conda install ipykernel
pip install ipykernel

ipython kernel install --name "scenicplus3" --user