CRISPR Screen | 基因编辑筛选靶点 | MAGeCKFlute

 

工具篇:MAGeCKFlute - Integrative analysis pipeline for pooled CRISPR functional genetic screens

安装折腾了几个小时,debug核心,根据关键提示Google,筛选找到答案即可。

error:

Error: package or namespace load failed for ‘depmap’:
.onLoad failed in loadNamespace() for 'depmap', details:
call: (function (cond)
error: error in evaluating the argument 'x' in selecting a method for function 'query': Failed to collect lazy table.
Caused by error in `db_collect()`:
! Arguments in `...` must be used.
✖ Problematic argument:
• ..1 = Inf
ℹ Did you misspell an argument name?
Error: loading failed
Execution halted
ERROR: loading failed


Solved:
downgrading dbplyr solves this issue

devtools::install_version("dbplyr", version = "2.3.4")

  

 

 


 

系统性的学习一下CRISPR screen的数据分析

前篇:fastq demultiplexing - 确实得到了比较靠谱的fastq文件

 

这次比较麻烦,需要从Undetermined的fastq开始,问题就是index是RC,当时没有考虑到。

自己做有点麻烦,如果直接用最短的index,会出现很多假阳性,所以必须extend。

index在read的header里,因为是dual index,所以需要准备index fastq,然后用工具提取即可,一晚上能搞完80G+80G的paired fastqs。 

 

fastq非常dirty,需要先trim一下,fastp一下,80G只剩下65G。【完全可以不做】下面的方法花里胡哨,其实没啥用。

Find a tar file in the above location. It has 12 samples from HEK293T and an excel with sample information.

I trimmed the reads using fastp to remove adapters, poly-G tails and low-quality bases. I used PEAR to merge mate pairs to make a single read based on default overlap parameters.

The fastq.gz files you have now, treat them as single end read files. I believe you have the inzolia guide file for annotation, I shared the simple csv for MAGECX in the same folder. Will keep HT29 data in the same folder once I am done preprocessing it.

 【可以略过,没必要做merge,知道有这么个工具就行】

其次就是merge,因为测序长度可能超过了insert size,所以需要寻找overlap,然后merge,这个PEAR可以做。【bbmerge.sh也能做,但比较旧】

R1 --------------------------->
                                       <------------------------- R2

但PEAR会剩下很多unassembled的reads,这时候就用BBtools合并一下,reformat.sh,就是粗暴的拼接。【reformat就是简单的cat,并不是RC后paste】

reformat.sh in1=sample_R1.fq.gz in2=sample_R2.fq.gz out=sample.fq.gz 

  

所以强制改了PEAR的参数,使用所有的reads,这样就不会有unassembled的reads了。

pear -n 0 -e -p 1.0 -v 0 -j 20 -f merged.fq.unassembled.forward.fastq -r merged.fq.unassembled.reverse.fastq -o unassembled

  

 


 

 

搞完之后,就可以跑MAGeCK的pipeline了。

实操分享-使用MAGeCK分析Bulk CRISPR Screen数据

conda create -c bioconda -n crispr_screen mageck
conda activate crispr_screen
conda update mageck

  

可以直接用paired fastq

mageck count -l library.txt -n prefix --sample-label A,B,C,D --fastq A_1.fq.gz B_1.fq.gz C_1.fq.gz D_1.fq.gz --fastq-2 A_2.fq.gz B_2.fq.gz C_2.fq.gz D_2.fq.gz

 

首先要准备cas12_library_all.csv文件

这次的cas12 lib比较特殊,是四个20 bp的sgRNA被拼接成了一个sgRNA,我试了用bowtie2和hisat2去比对到全长sgRNA,基本是0比对。

conda install bioinfo::hisat
conda install bioconda::hisat2

hisat2-build ../library_ref.fasta gRNA_ref

hisat2 -p 15 -t --very-sensitive \
    -x /home/zz950/projects/demultiplexing/demultiplexed/sgRNA_ref/gRNA_ref \
    -1 fastq/D7_1uM_1_1.fq \
    -2 fastq/D7_1uM_1_2.fq \
    -S D7_1uM_1.sam 2> D7_1uM_1.GenomeMapStat.xls


bowtie2-build ../bowtie2_ref.fasta gRNA_ref

bowtie2 -x /home/zz950/projects/demultiplexing/demultiplexed/sgRNA_ref_bowtie2/gRNA_ref -1 fastq/D7_1uM_1_1.fq -2 fastq/D7_1uM_1_2.fq -p 15 -S D7_1uM_1.sam

  

所以不得不拆分成四个sgRNA,然后继续用mageck来分析

mageck说是可以用fastq2,其实基本所有的信息都在fastq1里面,用了paired fastq反而会报错。

我们的样本

D7_1uM_1
D7_1uM_2
D7_DMSO_1
D7_DMSO_2
D2_DMSO_1
D2_DMSO_2
D2_0.1uM_1
D2_0.1uM_2
Plasmid

 

一行命令搞定:

mageck count -l ../cas12_library_all.csv -n results/all.cas12 --sample-label D7_1uM_1,D7_1uM_2,D7_DMSO_1,D7_DMSO_2,D2_DMSO_1,D2_DMSO_2,D2_0.1uM_1,D2_0.1uM_2,Plasmid --fastq D7_1uM_1_1.fq D7_1uM_2_1.fq D7_DMSO_1_1.fq D7_DMSO_2_1.fq D2_DMSO_1_1.fq D2_DMSO_2_1.fq D2_0.1uM_1_1.fq D2_0.1uM_2_1.fq Plasmid_1.fq
# 用了会报错
--fastq-2 D7_1uM_1_2.fq D7_1uM_2_2.fq D7_DMSO_1_2.fq D7_DMSO_2_2.fq D2_DMSO_1_2.fq D2_DMSO_2_2.fq D2_0.1uM_1_2.fq D2_0.1uM_2_2.fq Plasmid_2.fq

  

对每个样本单独做也行

mageck count -l ../cas12_library_all.csv -n results/D7_1uM_1 --sample-label D7_1uM_1 --fastq D7_1uM_1_1.fq
mageck count -l ../cas12_library_all.csv -n results/D7_1uM_2 --sample-label D7_1uM_2 --fastq D7_1uM_2_1.fq --fastq-2 D7_1uM_2_2.fq 
mageck count -l ../cas12_library_all.csv -n results/D7_DMSO_1 --sample-label D7_DMSO_1 --fastq D7_DMSO_1_1.fq --fastq-2 D7_DMSO_1_2.fq 
mageck count -l ../cas12_library_all.csv -n results/D7_DMSO_2 --sample-label D7_DMSO_2 --fastq D7_DMSO_2_1.fq 
mageck count -l ../cas12_library_all.csv -n results/D2_DMSO_1 --sample-label D2_DMSO_1 --fastq D2_DMSO_1_1.fq --fastq-2 D2_DMSO_1_2.fq 
mageck count -l ../cas12_library_all.csv -n results/D2_DMSO_2 --sample-label D2_DMSO_2 --fastq D2_DMSO_2_1.fq
mageck count -l ../cas12_library_all.csv -n results/D2_0.1uM_1 --sample-label D2_0.1uM_1 --fastq D2_0.1uM_1_1.fq
mageck count -l ../cas12_library_all.csv -n results/D2_0.1uM_2 --sample-label D2_0.1uM_2 --fastq D2_0.1uM_2_1.fq 
mageck count -l ../cas12_library_all.csv -n results/Plasmid --sample-label Plasmid --fastq Plasmid_1.fq --fastq-2 Plasmid_2.fq 

  

OK,得到count matrix了,可以做下游分析了。

 

讨论之后,发现必须要用top2 gRNA作为construct,来计数。count也没有减少多少。

mageck count -l ../cas12_library_top2.csv -n results/all.top2.cas12 --sample-label D7_1uM_1,D7_1uM_2,D7_DMSO_1,D7_DMSO_2,D2_DMSO_1,D2_DMSO_2,D2_0.1uM_1,D2_0.1uM_2,Plasmid --fastq D7_1uM_1_1.fq D7_1uM_2_1.fq D7_DMSO_1_1.fq D7_DMSO_2_1.fq D2_DMSO_1_1.fq D2_DMSO_2_1.fq D2_0.1uM_1_1.fq D2_0.1uM_2_1.fq Plasmid_1.fq

 

mageck test -k all.top2.cas12.count.merge.txt -t D7_1uM -c D7_DMSO -n D7_comp
mageck test -k all.top2.cas12.count.merge.txt -t D2_0.1uM -c D2_DMSO -n D2_comp

  

还有一种对比看时间效应:sample (D2 or D7) vs plasmid(Day 0)

但因为我们的plasmid数据不行,暂时就不对比了。

 

下游分析代码:projects/demultiplexing/demultiplexed/cas12_SOX9_dTAG_CRISPR_screen.ipynb 

 

从novogene得到了新的 demultiplexing 数据,对比了一下,数据量差不多,但是还是以他们为金标准,重新分析一遍吧,也挺快的。

 

 2057  zcat Inzolia_Plasmid/Inzolia_Plasmid_1.fq.gz Inzolia_Plasmid_Re/Inzolia_Plasmid_Re_1.fq.gz > Inzolia_Plasmid_1.fq
 2059  zcat Inzolia_Plasmid/Inzolia_Plasmid_2.fq.gz Inzolia_Plasmid_Re/Inzolia_Plasmid_Re_2.fq.gz > Inzolia_Plasmid_2.fq
 2061  zcat D7_DMSO_R2/D7_DMSO_R2_1.fq.gz D7_DMSO_R2_Re/D7_DMSO_R2_Re_1.fq.gz > D7_DMSO_R2_1.fq
 2062  zcat D7_DMSO_R2/D7_DMSO_R2_2.fq.gz D7_DMSO_R2_Re/D7_DMSO_R2_Re_2.fq.gz > D7_DMSO_R2_2.fq
 2064  zcat D2_DMSO_R1/D2_DMSO_R1_1.fq.gz D2_DMSO_R1_Re/D2_DMSO_R1_Re_1.fq.gz > D2_DMSO_R1_1.fq
 2065  zcat D2_DMSO_R1/D2_DMSO_R1_2.fq.gz D2_DMSO_R1_Re/D2_DMSO_R1_Re_2.fq.gz > D2_DMSO_R1_2.fq
 2066  zcat D7_1uM_R1/D7_1uM_R1_1.fq.gz D7_1uM_R1_Re/D7_1uM_R1_Re_1.fq.gz > D7_1uM_R1_1.fq
 2067  zcat D7_1uM_R1/D7_1uM_R1_2.fq.gz D7_1uM_R1_Re/D7_1uM_R1_Re_2.fq.gz > D7_1uM_R1_2.fq
 2068  zcat D2_100nM_R2/D2_100nM_R2_1.fq.gz D2_100nM_R2_Re/D2_100nM_R2_Re_1.fq.gz > D2_100nM_R2_1.fq
 2069  zcat D2_100nM_R2/D2_100nM_R2_2.fq.gz D2_100nM_R2_Re/D2_100nM_R2_Re_2.fq.gz > D2_100nM_R2_2.fq
 2070  zcat D7_DMSO_R1/D7_DMSO_R1_1.fq.gz D7_DMSO_R1_Re/D7_DMSO_R1_Re_1.fq.gz > D7_DMSO_R1_1.fq
 2071  zcat D7_DMSO_R1/D7_DMSO_R1_2.fq.gz D7_DMSO_R1_Re/D7_DMSO_R1_Re_2.fq.gz > D7_DMSO_R1_2.fq
 2072  zcat D2_DMSO_R2/D2_DMSO_R2_1.fq.gz D2_DMSO_R2_Re/D2_DMSO_R2_Re_1.fq.gz > D2_DMSO_R2_1.fq
 2073  zcat D2_DMSO_R2/D2_DMSO_R2_2.fq.gz D2_DMSO_R2_Re/D2_DMSO_R2_Re_2.fq.gz > D2_DMSO_R2_2.fq
 2074  zcat D7_1uM_R2/D7_1uM_R2_1.fq.gz D7_1uM_R2_Re/D7_1uM_R2_Re_1.fq.gz > D7_1uM_R2_1.fq
 2075  zcat D7_1uM_R2/D7_1uM_R2_2.fq.gz D7_1uM_R2_Re/D7_1uM_R2_Re_2.fq.gz > D7_1uM_R2_2.fq
 2076  zcat D2_100nM_R1/D2_100nM_R1_1.fq.gz D2_100nM_R1_Re/D2_100nM_R1_Re_1.fq.gz > D2_100nM_R1_1.fq
 2077  zcat D2_100nM_R1/D2_100nM_R1_2.fq.gz D2_100nM_R1_Re/D2_100nM_R1_Re_2.fq.gz > D2_100nM_R1_2.fq

  

mageck count -l ../../cas12_library_top2.csv -n results/all.2gRNA.cas12 --sample-label D7_1uM_1,D7_1uM_2,D7_DMSO_1,D7_DMSO_2,D2_DMSO_1,D2_DMSO_2,D2_0.1uM_1,D2_0.1uM_2 --fastq D7_1uM_R1_1.fq D7_1uM_R2_1.fq D7_DMSO_R1_1.fq D7_DMSO_R2_1.fq D2_DMSO_R1_1.fq D2_DMSO_R2_1.fq D2_100nM_R1_1.fq D2_100nM_R2_1.fq 
# --fastq-2 D7_1uM_R1_2.fq D7_1uM_R2_2.fq D7_DMSO_R1_2.fq D7_DMSO_R2_2.fq D2_DMSO_R1_2.fq D2_DMSO_R2_2.fq D2_100nM_R1_2.fq D2_100nM_R2_2.fq

mageck test -k all.2gRNA.cas12.count.merge.txt -t D7_1uM -c D7_DMSO -n D7_comp
mageck test -k all.2gRNA.cas12.count.merge.txt -t D2_0.1uM -c D2_DMSO -n D2_comp

  

mageck count -l ../../cas12_library_top2.csv -n results/all.2gRNA.cas12 --sample-label D7_1uM_1,D7_1uM_2,D7_DMSO_1,D7_DMSO_2,D2_DMSO_1,D2_DMSO_2,D2_0.1uM_1,D2_0.1uM_2,Plasmid --fastq D7_1uM_R1_1.fq D7_1uM_R2_1.fq D7_DMSO_R1_1.fq D7_DMSO_R2_1.fq D2_DMSO_R1_1.fq D2_DMSO_R2_1.fq D2_100nM_R1_1.fq D2_100nM_R2_1.fq Inzolia_Plasmid_1.fq
# --fastq-2 D7_1uM_R1_2.fq D7_1uM_R2_2.fq D7_DMSO_R1_2.fq D7_DMSO_R2_2.fq D2_DMSO_R1_2.fq D2_DMSO_R2_2.fq D2_100nM_R1_2.fq D2_100nM_R2_2.fq

  

 

CRISRP共同量测序分为阳性筛选和阴性筛选。

阳性筛选指施加一定的筛选压力,经文库扰动后野生型细胞致死,仅有获得抗性的细胞存活。

与阳性筛选不同,阴性筛选是通过比较不同筛选时间点的sgRNA丰度差,获得缺失的sgRNA,分析其所对应的基因,从而找出可能影响细胞生长的候选基因

MAGeCK is designed to identify positively and negatively selected sgRNAs and genes in genome-scale CRISPR/Cas9 knockout experiments.

Genes are ranked by the p.neg field (by default).

 

结果解读

Output file specification

https://sourceforge.net/p/mageck/wiki/output/#gene_summary_txt 

 

两种不同的compare方法

mageck test -k all.2gRNA.cas12.count.merge.txt -t D7_1uM -c D7_DMSO -n D7_comp
mageck test -k all.2gRNA.cas12.count.merge.txt -t D2_0.1uM -c D2_DMSO -n D2_comp


mageck mle -k all.2gRNA.cas12.count.merge.txt -d D2_designmatrix.txt -n D2_mle_comp
mageck mle -k all.2gRNA.cas12.count.merge.txt -d D7_designmatrix.txt -n D7_mle_comp

  

第一种rra就是肯定有两种结果,一是positive,一是negative;

第二种就是mle,只有一个beta,出来的火山图比较奇怪,可以对比一下;

 

参考:

  • Crispr-cas9 screen analysis mageck
  • 可以用其他sgrna counter,但我还是喜欢用成熟的mageck
  • cas12 paper: Efficient gene knockout and genetic interactions: the IN4MER CRISPR/Cas12a multiplex knockout platform

 

posted @ 2024-01-31 08:09  Life·Intelligence  阅读(169)  评论(0编辑  收藏  举报
TOP