CRISPR Screen | 基因编辑筛选靶点 | MAGeCKFlute
工具篇:MAGeCKFlute - Integrative analysis pipeline for pooled CRISPR functional genetic screens
安装折腾了几个小时,debug核心,根据关键提示Google,筛选找到答案即可。
error:
Error: package or namespace load failed for ‘depmap’: .onLoad failed in loadNamespace() for 'depmap', details: call: (function (cond) error: error in evaluating the argument 'x' in selecting a method for function 'query': Failed to collect lazy table. Caused by error in `db_collect()`: ! Arguments in `...` must be used. ✖ Problematic argument: • ..1 = Inf ℹ Did you misspell an argument name? Error: loading failed Execution halted ERROR: loading failed
Solved:
downgrading dbplyr solves this issue
devtools::install_version("dbplyr", version = "2.3.4")
系统性的学习一下CRISPR screen的数据分析
前篇:fastq demultiplexing - 确实得到了比较靠谱的fastq文件
这次比较麻烦,需要从Undetermined的fastq开始,问题就是index是RC,当时没有考虑到。
自己做有点麻烦,如果直接用最短的index,会出现很多假阳性,所以必须extend。
index在read的header里,因为是dual index,所以需要准备index fastq,然后用工具提取即可,一晚上能搞完80G+80G的paired fastqs。
fastq非常dirty,需要先trim一下,fastp一下,80G只剩下65G。【完全可以不做】下面的方法花里胡哨,其实没啥用。
Find a tar file in the above location. It has 12 samples from HEK293T and an excel with sample information.
I trimmed the reads using fastp to remove adapters, poly-G tails and low-quality bases. I used PEAR to merge mate pairs to make a single read based on default overlap parameters.
The fastq.gz files you have now, treat them as single end read files. I believe you have the inzolia guide file for annotation, I shared the simple csv for MAGECX in the same folder. Will keep HT29 data in the same folder once I am done preprocessing it.
【可以略过,没必要做merge,知道有这么个工具就行】
其次就是merge,因为测序长度可能超过了insert size,所以需要寻找overlap,然后merge,这个PEAR可以做。【bbmerge.sh也能做,但比较旧】
R1 --------------------------->
<------------------------- R2
但PEAR会剩下很多unassembled的reads,这时候就用BBtools合并一下,reformat.sh,就是粗暴的拼接。【reformat就是简单的cat,并不是RC后paste】
reformat.sh in1=sample_R1.fq.gz in2=sample_R2.fq.gz out=sample.fq.gz
所以强制改了PEAR的参数,使用所有的reads,这样就不会有unassembled的reads了。
pear -n 0 -e -p 1.0 -v 0 -j 20 -f merged.fq.unassembled.forward.fastq -r merged.fq.unassembled.reverse.fastq -o unassembled
搞完之后,就可以跑MAGeCK的pipeline了。
实操分享-使用MAGeCK分析Bulk CRISPR Screen数据
conda create -c bioconda -n crispr_screen mageck conda activate crispr_screen conda update mageck
可以直接用paired fastq
mageck count -l library.txt -n prefix --sample-label A,B,C,D --fastq A_1.fq.gz B_1.fq.gz C_1.fq.gz D_1.fq.gz --fastq-2 A_2.fq.gz B_2.fq.gz C_2.fq.gz D_2.fq.gz
首先要准备cas12_library_all.csv文件
这次的cas12 lib比较特殊,是四个20 bp的sgRNA被拼接成了一个sgRNA,我试了用bowtie2和hisat2去比对到全长sgRNA,基本是0比对。
conda install bioinfo::hisat conda install bioconda::hisat2 hisat2-build ../library_ref.fasta gRNA_ref hisat2 -p 15 -t --very-sensitive \ -x /home/zz950/projects/demultiplexing/demultiplexed/sgRNA_ref/gRNA_ref \ -1 fastq/D7_1uM_1_1.fq \ -2 fastq/D7_1uM_1_2.fq \ -S D7_1uM_1.sam 2> D7_1uM_1.GenomeMapStat.xls bowtie2-build ../bowtie2_ref.fasta gRNA_ref bowtie2 -x /home/zz950/projects/demultiplexing/demultiplexed/sgRNA_ref_bowtie2/gRNA_ref -1 fastq/D7_1uM_1_1.fq -2 fastq/D7_1uM_1_2.fq -p 15 -S D7_1uM_1.sam
所以不得不拆分成四个sgRNA,然后继续用mageck来分析
mageck说是可以用fastq2,其实基本所有的信息都在fastq1里面,用了paired fastq反而会报错。
我们的样本
D7_1uM_1
D7_1uM_2
D7_DMSO_1
D7_DMSO_2
D2_DMSO_1
D2_DMSO_2
D2_0.1uM_1
D2_0.1uM_2
Plasmid
一行命令搞定:
mageck count -l ../cas12_library_all.csv -n results/all.cas12 --sample-label D7_1uM_1,D7_1uM_2,D7_DMSO_1,D7_DMSO_2,D2_DMSO_1,D2_DMSO_2,D2_0.1uM_1,D2_0.1uM_2,Plasmid --fastq D7_1uM_1_1.fq D7_1uM_2_1.fq D7_DMSO_1_1.fq D7_DMSO_2_1.fq D2_DMSO_1_1.fq D2_DMSO_2_1.fq D2_0.1uM_1_1.fq D2_0.1uM_2_1.fq Plasmid_1.fq
# 用了会报错 --fastq-2 D7_1uM_1_2.fq D7_1uM_2_2.fq D7_DMSO_1_2.fq D7_DMSO_2_2.fq D2_DMSO_1_2.fq D2_DMSO_2_2.fq D2_0.1uM_1_2.fq D2_0.1uM_2_2.fq Plasmid_2.fq
对每个样本单独做也行
mageck count -l ../cas12_library_all.csv -n results/D7_1uM_1 --sample-label D7_1uM_1 --fastq D7_1uM_1_1.fq mageck count -l ../cas12_library_all.csv -n results/D7_1uM_2 --sample-label D7_1uM_2 --fastq D7_1uM_2_1.fq --fastq-2 D7_1uM_2_2.fq mageck count -l ../cas12_library_all.csv -n results/D7_DMSO_1 --sample-label D7_DMSO_1 --fastq D7_DMSO_1_1.fq --fastq-2 D7_DMSO_1_2.fq mageck count -l ../cas12_library_all.csv -n results/D7_DMSO_2 --sample-label D7_DMSO_2 --fastq D7_DMSO_2_1.fq mageck count -l ../cas12_library_all.csv -n results/D2_DMSO_1 --sample-label D2_DMSO_1 --fastq D2_DMSO_1_1.fq --fastq-2 D2_DMSO_1_2.fq mageck count -l ../cas12_library_all.csv -n results/D2_DMSO_2 --sample-label D2_DMSO_2 --fastq D2_DMSO_2_1.fq mageck count -l ../cas12_library_all.csv -n results/D2_0.1uM_1 --sample-label D2_0.1uM_1 --fastq D2_0.1uM_1_1.fq mageck count -l ../cas12_library_all.csv -n results/D2_0.1uM_2 --sample-label D2_0.1uM_2 --fastq D2_0.1uM_2_1.fq mageck count -l ../cas12_library_all.csv -n results/Plasmid --sample-label Plasmid --fastq Plasmid_1.fq --fastq-2 Plasmid_2.fq
OK,得到count matrix了,可以做下游分析了。
讨论之后,发现必须要用top2 gRNA作为construct,来计数。count也没有减少多少。
mageck count -l ../cas12_library_top2.csv -n results/all.top2.cas12 --sample-label D7_1uM_1,D7_1uM_2,D7_DMSO_1,D7_DMSO_2,D2_DMSO_1,D2_DMSO_2,D2_0.1uM_1,D2_0.1uM_2,Plasmid --fastq D7_1uM_1_1.fq D7_1uM_2_1.fq D7_DMSO_1_1.fq D7_DMSO_2_1.fq D2_DMSO_1_1.fq D2_DMSO_2_1.fq D2_0.1uM_1_1.fq D2_0.1uM_2_1.fq Plasmid_1.fq
mageck test -k all.top2.cas12.count.merge.txt -t D7_1uM -c D7_DMSO -n D7_comp mageck test -k all.top2.cas12.count.merge.txt -t D2_0.1uM -c D2_DMSO -n D2_comp
还有一种对比看时间效应:sample (D2 or D7) vs plasmid(Day 0)
但因为我们的plasmid数据不行,暂时就不对比了。
下游分析代码:projects/demultiplexing/demultiplexed/cas12_SOX9_dTAG_CRISPR_screen.ipynb
从novogene得到了新的 demultiplexing 数据,对比了一下,数据量差不多,但是还是以他们为金标准,重新分析一遍吧,也挺快的。
2057 zcat Inzolia_Plasmid/Inzolia_Plasmid_1.fq.gz Inzolia_Plasmid_Re/Inzolia_Plasmid_Re_1.fq.gz > Inzolia_Plasmid_1.fq 2059 zcat Inzolia_Plasmid/Inzolia_Plasmid_2.fq.gz Inzolia_Plasmid_Re/Inzolia_Plasmid_Re_2.fq.gz > Inzolia_Plasmid_2.fq 2061 zcat D7_DMSO_R2/D7_DMSO_R2_1.fq.gz D7_DMSO_R2_Re/D7_DMSO_R2_Re_1.fq.gz > D7_DMSO_R2_1.fq 2062 zcat D7_DMSO_R2/D7_DMSO_R2_2.fq.gz D7_DMSO_R2_Re/D7_DMSO_R2_Re_2.fq.gz > D7_DMSO_R2_2.fq 2064 zcat D2_DMSO_R1/D2_DMSO_R1_1.fq.gz D2_DMSO_R1_Re/D2_DMSO_R1_Re_1.fq.gz > D2_DMSO_R1_1.fq 2065 zcat D2_DMSO_R1/D2_DMSO_R1_2.fq.gz D2_DMSO_R1_Re/D2_DMSO_R1_Re_2.fq.gz > D2_DMSO_R1_2.fq 2066 zcat D7_1uM_R1/D7_1uM_R1_1.fq.gz D7_1uM_R1_Re/D7_1uM_R1_Re_1.fq.gz > D7_1uM_R1_1.fq 2067 zcat D7_1uM_R1/D7_1uM_R1_2.fq.gz D7_1uM_R1_Re/D7_1uM_R1_Re_2.fq.gz > D7_1uM_R1_2.fq 2068 zcat D2_100nM_R2/D2_100nM_R2_1.fq.gz D2_100nM_R2_Re/D2_100nM_R2_Re_1.fq.gz > D2_100nM_R2_1.fq 2069 zcat D2_100nM_R2/D2_100nM_R2_2.fq.gz D2_100nM_R2_Re/D2_100nM_R2_Re_2.fq.gz > D2_100nM_R2_2.fq 2070 zcat D7_DMSO_R1/D7_DMSO_R1_1.fq.gz D7_DMSO_R1_Re/D7_DMSO_R1_Re_1.fq.gz > D7_DMSO_R1_1.fq 2071 zcat D7_DMSO_R1/D7_DMSO_R1_2.fq.gz D7_DMSO_R1_Re/D7_DMSO_R1_Re_2.fq.gz > D7_DMSO_R1_2.fq 2072 zcat D2_DMSO_R2/D2_DMSO_R2_1.fq.gz D2_DMSO_R2_Re/D2_DMSO_R2_Re_1.fq.gz > D2_DMSO_R2_1.fq 2073 zcat D2_DMSO_R2/D2_DMSO_R2_2.fq.gz D2_DMSO_R2_Re/D2_DMSO_R2_Re_2.fq.gz > D2_DMSO_R2_2.fq 2074 zcat D7_1uM_R2/D7_1uM_R2_1.fq.gz D7_1uM_R2_Re/D7_1uM_R2_Re_1.fq.gz > D7_1uM_R2_1.fq 2075 zcat D7_1uM_R2/D7_1uM_R2_2.fq.gz D7_1uM_R2_Re/D7_1uM_R2_Re_2.fq.gz > D7_1uM_R2_2.fq 2076 zcat D2_100nM_R1/D2_100nM_R1_1.fq.gz D2_100nM_R1_Re/D2_100nM_R1_Re_1.fq.gz > D2_100nM_R1_1.fq 2077 zcat D2_100nM_R1/D2_100nM_R1_2.fq.gz D2_100nM_R1_Re/D2_100nM_R1_Re_2.fq.gz > D2_100nM_R1_2.fq
mageck count -l ../../cas12_library_top2.csv -n results/all.2gRNA.cas12 --sample-label D7_1uM_1,D7_1uM_2,D7_DMSO_1,D7_DMSO_2,D2_DMSO_1,D2_DMSO_2,D2_0.1uM_1,D2_0.1uM_2 --fastq D7_1uM_R1_1.fq D7_1uM_R2_1.fq D7_DMSO_R1_1.fq D7_DMSO_R2_1.fq D2_DMSO_R1_1.fq D2_DMSO_R2_1.fq D2_100nM_R1_1.fq D2_100nM_R2_1.fq # --fastq-2 D7_1uM_R1_2.fq D7_1uM_R2_2.fq D7_DMSO_R1_2.fq D7_DMSO_R2_2.fq D2_DMSO_R1_2.fq D2_DMSO_R2_2.fq D2_100nM_R1_2.fq D2_100nM_R2_2.fq mageck test -k all.2gRNA.cas12.count.merge.txt -t D7_1uM -c D7_DMSO -n D7_comp mageck test -k all.2gRNA.cas12.count.merge.txt -t D2_0.1uM -c D2_DMSO -n D2_comp
mageck count -l ../../cas12_library_top2.csv -n results/all.2gRNA.cas12 --sample-label D7_1uM_1,D7_1uM_2,D7_DMSO_1,D7_DMSO_2,D2_DMSO_1,D2_DMSO_2,D2_0.1uM_1,D2_0.1uM_2,Plasmid --fastq D7_1uM_R1_1.fq D7_1uM_R2_1.fq D7_DMSO_R1_1.fq D7_DMSO_R2_1.fq D2_DMSO_R1_1.fq D2_DMSO_R2_1.fq D2_100nM_R1_1.fq D2_100nM_R2_1.fq Inzolia_Plasmid_1.fq # --fastq-2 D7_1uM_R1_2.fq D7_1uM_R2_2.fq D7_DMSO_R1_2.fq D7_DMSO_R2_2.fq D2_DMSO_R1_2.fq D2_DMSO_R2_2.fq D2_100nM_R1_2.fq D2_100nM_R2_2.fq
CRISRP共同量测序分为阳性筛选和阴性筛选。
阳性筛选指施加一定的筛选压力,经文库扰动后野生型细胞致死,仅有获得抗性的细胞存活。
与阳性筛选不同,阴性筛选是通过比较不同筛选时间点的sgRNA丰度差,获得缺失的sgRNA,分析其所对应的基因,从而找出可能影响细胞生长的候选基因。
MAGeCK is designed to identify positively and negatively selected sgRNAs and genes in genome-scale CRISPR/Cas9 knockout experiments.
Genes are ranked by the p.neg field (by default).
结果解读
Output file specification
https://sourceforge.net/p/mageck/wiki/output/#gene_summary_txt
两种不同的compare方法
mageck test -k all.2gRNA.cas12.count.merge.txt -t D7_1uM -c D7_DMSO -n D7_comp mageck test -k all.2gRNA.cas12.count.merge.txt -t D2_0.1uM -c D2_DMSO -n D2_comp mageck mle -k all.2gRNA.cas12.count.merge.txt -d D2_designmatrix.txt -n D2_mle_comp mageck mle -k all.2gRNA.cas12.count.merge.txt -d D7_designmatrix.txt -n D7_mle_comp
第一种rra就是肯定有两种结果,一是positive,一是negative;
第二种就是mle,只有一个beta,出来的火山图比较奇怪,可以对比一下;
参考:
- Crispr-cas9 screen analysis mageck
- 可以用其他sgrna counter,但我还是喜欢用成熟的mageck
- cas12 paper: Efficient gene knockout and genetic interactions: the IN4MER CRISPR/Cas12a multiplex knockout platform