极简 | GRN | SCENIC | pySCENIC | 安装使用最新版scenicplus
2025年01月24日
level 1:单细胞RNA-seq引爆了GRN的分析,无非就是co-expression,motif富集,代表性工具就是pySCENIC。
level 2:单细胞多组学则更加减少了假阳性,代表性工具就是scenicplus。
level 3:我们lab的dTAG的bulk多组学,直接的binding、causal的表达因果,这才是真正的GRN。
目前的认知,单细胞里面的一个cell type or cell state的full marker里,肯定有一些TF是driver,这些就肯定可以被构建成GRN,其实信息都在marker里,我们只需要找到那些TF。
单细胞多组学无非就是提供了一个更fancy的证据motif的活性,目前看SOX9本身就很靠谱。
当然最终肯定是要做指定的某个TF的KO、KD的多组学。
那就先跑一下流程
软件安装,pySCENIC速度显著快于R的版本。
https://github.com/aertslab/pySCENIC
/home/zz950/softwares/miniconda3/envs/r4p3/bin/pyscenic -h
目前最后更新的版本是0.12.1,Nov 21, 2022,检查一下是否是最新版本。【YES,确定是最新版】
准备数据
一个基本的常识,correlation非常依赖context,即你使用哪些细胞,出来的correlation是完全不一样的。
我需要做4个GRN
- pan-stem cell的GRN;【pan的定义就是在30%以上的细胞里面表达】
- Lgr5 stem cell的GRN;
- proliferative stem cell的GRN;
- revival stem cell的GRN;
根据我现在的认知,必须有阴性对照的细胞,不然出不来correlation,那就拿enterocyte来做阴性对照。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | /home/zz950/softwares/miniconda3/envs/r4p3/bin/pyscenic grn -h usage: pyscenic grn [-h] [-o OUTPUT] [-t] [-m {genie3,grnboost2}] [--seed SEED] [--num_workers NUM_WORKERS] [--client_or_address CLIENT_OR_ADDRESS] [--cell_id_attribute CELL_ID_ATTRIBUTE] [--gene_attribute GENE_ATTRIBUTE] [--sparse] expression_mtx_fname tfs_fname positional arguments: expression_mtx_fname The name of the file that contains the expression matrix for the single cell experiment. Two file formats are supported: csv (rows=cells x columns=genes) or loom (rows=genes x columns=cells). tfs_fname The name of the file that contains the list of transcription factors (TXT; one TF per line). optional arguments: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT Output file/stream, i.e. a table of TF-target genes (CSV). -t, --transpose Transpose the expression matrix (rows=genes x columns=cells). -m {genie3,grnboost2}, --method {genie3,grnboost2} The algorithm for gene regulatory network reconstruction ( default : grnboost2). --seed SEED Seed value for regressor random state initialization. Applies to both GENIE3 and GRNBoost2. The default is to use a random seed. computation arguments: --num_workers NUM_WORKERS The number of workers to use. Only valid if using dask_multiprocessing, custom_multiprocessing or local as mode. ( default : 8). --client_or_address CLIENT_OR_ADDRESS The client or the IP address of the dask scheduler to use. (Only required of dask_cluster is selected as mode) loom file arguments: --cell_id_attribute CELL_ID_ATTRIBUTE The name of the column attribute that specifies the identifiers of the cells in the loom file. --gene_attribute GENE_ATTRIBUTE The name of the row attribute that specifies the gene symbols in the loom file. --sparse If set , load the expression data as a sparse matrix. Currently applies to the grn inference step only. |
下载TF list
$ wc -l allTFs_hg38.txt
1892 allTFs_hg38.txt
wc -l allTFs_mm.txt
1860 allTFs_mm.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | $ pyscenic ctx -h usage: pyscenic ctx [-h] [-o OUTPUT] [-n] [--chunk_size CHUNK_SIZE] [--mode {custom_multiprocessing,dask_multiprocessing,dask_cluster}] [-a] [-t] [--rank_threshold RANK_THRESHOLD] [--auc_threshold AUC_THRESHOLD] [--nes_threshold NES_THRESHOLD] [--min_orthologous_identity MIN_ORTHOLOGOUS_IDENTITY] [--max_similarity_fdr MAX_SIMILARITY_FDR] --annotations_fname ANNOTATIONS_FNAME [--num_workers NUM_WORKERS] [--client_or_address CLIENT_OR_ADDRESS] [--thresholds THRESHOLDS [THRESHOLDS ...]] [--top_n_targets TOP_N_TARGETS [TOP_N_TARGETS ...]] [--top_n_regulators TOP_N_REGULATORS [TOP_N_REGULATORS ...]] [--min_genes MIN_GENES] [--expression_mtx_fname EXPRESSION_MTX_FNAME] [--mask_dropouts] [--cell_id_attribute CELL_ID_ATTRIBUTE] [--gene_attribute GENE_ATTRIBUTE] [--sparse] module_fname database_fname [database_fname ...] positional arguments: module_fname The name of the file that contains the signature or the co-expression modules. The following formats are supported: CSV or TSV (adjacencies), YAML, GMT and DAT (modules) database_fname The name(s) of the regulatory feature databases. Two file formats are supported: feather or db (legacy). optional arguments: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT Output file/stream, i.e. a table of enriched motifs and target genes (csv, tsv) or collection of regulons (yaml, gmt, dat, json). -n, --no_pruning Do not perform pruning, i.e. find enriched motifs. --chunk_size CHUNK_SIZE The size of the module chunks assigned to a node in the dask graph ( default : 100). --mode {custom_multiprocessing,dask_multiprocessing,dask_cluster} The mode to be used for computing ( default : custom_multiprocessing). -a, --all_modules Included positive and negative regulons in the analysis ( default : no, i.e. only positive). -t, --transpose Transpose the expression matrix (rows=genes x columns=cells). motif enrichment arguments: --rank_threshold RANK_THRESHOLD The rank threshold used for deriving the target genes of an enriched motif ( default : 5000). --auc_threshold AUC_THRESHOLD The threshold used for calculating the AUC of a feature as fraction of ranked genes ( default : 0.05). --nes_threshold NES_THRESHOLD The Normalized Enrichment Score (NES) threshold for finding enriched features ( default : 3.0). motif annotation arguments: --min_orthologous_identity MIN_ORTHOLOGOUS_IDENTITY Minimum orthologous identity to use when annotating enriched motifs ( default : 0.0). --max_similarity_fdr MAX_SIMILARITY_FDR Maximum FDR in motif similarity to use when annotating enriched motifs ( default : 0.001). --annotations_fname ANNOTATIONS_FNAME The name of the file that contains the motif annotations to use. computation arguments: --num_workers NUM_WORKERS The number of workers to use. Only valid if using dask_multiprocessing, custom_multiprocessing or local as mode. ( default : 64). --client_or_address CLIENT_OR_ADDRESS The client or the IP address of the dask scheduler to use. (Only required of dask_cluster is selected as mode) module generation arguments: --thresholds THRESHOLDS [THRESHOLDS ...] The first method to create the TF-modules based on the best targets for each transcription factor ( default : 0.75 0.90). --top_n_targets TOP_N_TARGETS [TOP_N_TARGETS ...] The second method is to select the top targets for a given TF. ( default : 50) --top_n_regulators TOP_N_REGULATORS [TOP_N_REGULATORS ...] The alternative way to create the TF-modules is to select the best regulators for each gene. ( default : 5 10 50) --min_genes MIN_GENES The minimum number of genes in a module ( default : 20). --expression_mtx_fname EXPRESSION_MTX_FNAME The name of the file that contains the expression matrix for the single cell experiment. Two file formats are supported: csv (rows=cells x columns=genes) or loom (rows=genes x columns=cells). (Only required if modules need to be generated) --mask_dropouts If modules need to be generated, this controls whether cell dropouts (cells in which expression of either TF or target gene is 0) are masked when calculating the correlation between a TF-target pair. This affects which target genes are included in the initial modules, and the final pruned regulon ( by default only positive regulons are kept (see --all_modules option)). The default value in pySCENIC 0.9.16 and previous versions was to mask dropouts when calculating the correlation; however, all cells are now kept by default , to match the R version. loom file arguments: --cell_id_attribute CELL_ID_ATTRIBUTE The name of the column attribute that specifies the identifiers of the cells in the loom file. --gene_attribute GENE_ATTRIBUTE The name of the row attribute that specifies the gene symbols in the loom file. --sparse If set , load the expression data as a sparse matrix. Currently applies to the grn inference step only. |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | $ pyscenic aucell -h usage: pyscenic aucell [-h] [-o OUTPUT] [-t] [-w] [--num_workers NUM_WORKERS] [--seed SEED] [--rank_threshold RANK_THRESHOLD] [--auc_threshold AUC_THRESHOLD] [--nes_threshold NES_THRESHOLD] [--cell_id_attribute CELL_ID_ATTRIBUTE] [--gene_attribute GENE_ATTRIBUTE] [--sparse] expression_mtx_fname signatures_fname positional arguments: expression_mtx_fname The name of the file that contains the expression matrix for the single cell experiment. Two file formats are supported: csv (rows=cells x columns=genes) or loom (rows=genes x columns=cells). signatures_fname The name of the file that contains the gene signatures. Three file formats are supported: gmt, yaml or dat (pickle). optional arguments: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT Output file/stream, a matrix of AUC values. Two file formats are supported: csv or loom. If loom file is specified the loom file while contain the original expression matrix and the calculated AUC values as extra column attributes. -t, --transpose Transpose the expression matrix if supplied as csv (rows=genes x columns=cells). -w, --weights Use weights associated with genes in recovery analysis. Is only relevant when gene signatures are supplied as json format. --num_workers NUM_WORKERS The number of workers to use ( default : 64). --seed SEED Seed for the expression matrix ranking step. The default is to use a random seed. motif enrichment arguments: --rank_threshold RANK_THRESHOLD The rank threshold used for deriving the target genes of an enriched motif ( default : 5000). --auc_threshold AUC_THRESHOLD The threshold used for calculating the AUC of a feature as fraction of ranked genes ( default : 0.05). --nes_threshold NES_THRESHOLD The Normalized Enrichment Score (NES) threshold for finding enriched features ( default : 3.0). loom file arguments: --cell_id_attribute CELL_ID_ATTRIBUTE The name of the column attribute that specifies the identifiers of the cells in the loom file. --gene_attribute GENE_ATTRIBUTE The name of the row attribute that specifies the gene symbols in the loom file. --sparse If set , load the expression data as a sparse matrix. Currently applies to the grn inference step only. |
http://localhost:17449/lab/tree/projects/CRC_Atlas/2023_Qin_CSC_fetal_Cell/scenicplus.ipynb
# ERROR: Package 'scenicplus' requires a different Python: 3.11.11 not in '<=3.11.8,>=3.8'
conda create --name scenicplus3 python=3.11.8
# fail for version 1 & 2
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
## ModuleNotFoundError: No module named 'ipykernel'
# conda install ipykernel
pip install ipykernel
ipython kernel install --name "scenicplus3" --user
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)